[jira] [Commented] (SPARK-40873) Spark doesn't see some Parquet columns written from r-arrow

2022-10-21 Thread Daniel Darabos (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622311#comment-17622311
 ] 

Daniel Darabos commented on SPARK-40873:


This works on the R side for dropping the metadata:

{{df$schema$metadata <- NULL}}

After that Spark sees my new columns!

I don't know if this is a Spark bug or an Arrow bug or a bug at all. Hopefully 
the next person to hit this problem finds this issue.

> Spark doesn't see some Parquet columns written from r-arrow
> ---
>
> Key: SPARK-40873
> URL: https://issues.apache.org/jira/browse/SPARK-40873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Daniel Darabos
>Priority: Minor
> Attachments: part-0.parquet
>
>
> I have a Parquet file that was created in R with the r-arrow package version 
> 9.0.0 from Conda Forge with the write_dataset() function. It has four 
> columns, but Spark 3.3.0 only sees two of them.
> {{>>> df = spark.read.parquet('part-0.parquet')}}
> {{()}}
> {{>>> df.head()}}
> {{Row(name='Adam', age=20.0)}}
> {{>>> df.columns}}
> {{['name', 'age']}}
> {{>>> import pandas as pd}}
> {{>>> pd.read_parquet('part-0.parquet')}}
> {{           name   age   age_2      age_4}}
> {{0          Adam  20.0   400.0   16.0}}
> {{1           Eve  18.0   324.0   104976.0}}
> {{2           Bob  50.0  2500.0  625.0}}
> {{3  Isolated Joe   2.0     4.0       16.0}}
> {{>>> import pyarrow as pa}}
> {{>>> import pyarrow.parquet as pq}}
> {{>>> t = pq.read_table('part-0.parquet')}}
> {{>>> t}}
> {{pyarrow.Table}}
> {{name: string}}
> {{age: double}}
> {{age_2: double}}
> {{age_4: double}}
> {{}}
> {{name: [["Adam","Eve","Bob","Isolated Joe"]]}}
> {{age: [[20,18,50,2]]}}
> {{age_2: [[400,324,2500,4]]}}
> {{age_4: [[16,104976,625,16]]}}
> {{>>> pq.read_metadata('part-0.parquet')}}
> {{}}
> {{  created_by: parquet-cpp-arrow version 9.0.0}}
> {{  num_columns: 4}}
> {{  num_rows: 4}}
> {{  num_row_groups: 1}}
> {{  format_version: 2.6}}
> {{  serialized_size: 1510}}
> {{>>> pq.read_metadata('part-0.parquet').schema}}
> {{}}
> {{required group field_id=-1 schema {}}
> {{  optional binary field_id=-1 name (String);}}
> {{  optional double field_id=-1 age;}}
> {{  optional double field_id=-1 age_2;}}
> {{  optional double field_id=-1 age_4;}}
> {{}}}
> "age_2" and "age_4" look no different from "age" based on the schema. I tried 
> changing the names (just letters) but I still get the same behavior.
> Is something wrong with my file? Is something wrong with Spark?
> (I'll attach the file in a minute, I just need to figure out how.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40873) Spark doesn't see some Parquet columns written from r-arrow

2022-10-21 Thread Daniel Darabos (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622251#comment-17622251
 ] 

Daniel Darabos commented on SPARK-40873:


Oh, I think I got it! With debug logging Spark prints a lot of stuff, including 
the metadata:

{{"keyValueMetaData" : {}}
{{      "ARROW:schema" : 
"/zACAAAQAAAKAA4ABgAFAAgACgABBAAQAAAKAAwEAAgACgAAACwBAAAEAgAAAOgEKP///wg0KQAAAG9yZy5hcGFjaGUuc3Bhcmsuc3FsLnBhcnF1ZXQucm93Lm1ldGFkYXRhlwAAAHsidHlwZSI6InN0cnVjdCIsImZpZWxkcyI6W3sibmFtZSI6Im5hbWUiLCJ0eXBlIjoic3RyaW5nIiwibnVsbGFibGUiOnRydWUsIm1ldGFkYXRhIjp7fX0seyJuYW1lIjoiYWdlIiwidHlwZSI6ImRvdWJsZSIsIm51bGxhYmxlIjp0cnVlLCJtZXRhZGF0YSI6e319XX0ACAAMAAQACAAICCQYb3JnLmFwYWNoZS5zcGFyay52ZXJzaW9uAAUzLjMuMAQAAACoZDQEeP///wAAAQMQGAQABQAAAGFnZV80qv///wAAAgCkAAABAxAYBAAFYWdlXzIAAADWAAACAND///8AAAEDEBwEAAMAAABhZ2UGAAgABgAGAAACABAAFAAIAAYABwAMEAAQAAABBRAcBAAEbmFtZQAEAAQABA==",}}
{{      "org.apache.spark.version" : "3.3.0",}}
{{      "org.apache.spark.sql.parquet.row.metadata" : 
"\{\"type\":\"struct\",\"fields\":[{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},\{\"name\":\"age\",\"type\":\"double\",\"nullable\":true,\"metadata\":{}}]}"}}
{{    },}}

This file is based on a Parquet file written by Spark. The "age_2" and "age_4" 
columns were added in R. Looks like r-arrow managed to carry over the metadata 
from the original file, so we have 
"{{{}org.apache.spark.sql.parquet.row.metadata{}}}" with only "name" and "age".

I'll see if I can drop the metadata in r-arrow.

> Spark doesn't see some Parquet columns written from r-arrow
> ---
>
> Key: SPARK-40873
> URL: https://issues.apache.org/jira/browse/SPARK-40873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Daniel Darabos
>Priority: Minor
> Attachments: part-0.parquet
>
>
> I have a Parquet file that was created in R with the r-arrow package version 
> 9.0.0 from Conda Forge with the write_dataset() function. It has four 
> columns, but Spark 3.3.0 only sees two of them.
> {{>>> df = spark.read.parquet('part-0.parquet')}}
> {{()}}
> {{>>> df.head()}}
> {{Row(name='Adam', age=20.0)}}
> {{>>> df.columns}}
> {{['name', 'age']}}
> {{>>> import pandas as pd}}
> {{>>> pd.read_parquet('part-0.parquet')}}
> {{           name   age   age_2      age_4}}
> {{0          Adam  20.0   400.0   16.0}}
> {{1           Eve  18.0   324.0   104976.0}}
> {{2           Bob  50.0  2500.0  625.0}}
> {{3  Isolated Joe   2.0     4.0       16.0}}
> {{>>> import pyarrow as pa}}
> {{>>> import pyarrow.parquet as pq}}
> {{>>> t = pq.read_table('part-0.parquet')}}
> {{>>> t}}
> {{pyarrow.Table}}
> {{name: string}}
> {{age: double}}
> {{age_2: double}}
> {{age_4: double}}
> {{}}
> {{name: [["Adam","Eve","Bob","Isolated Joe"]]}}
> {{age: [[20,18,50,2]]}}
> {{age_2: [[400,324,2500,4]]}}
> {{age_4: [[16,104976,625,16]]}}
> {{>>> pq.read_metadata('part-0.parquet')}}
> {{}}
> {{  created_by: parquet-cpp-arrow version 9.0.0}}
> {{  num_columns: 4}}
> {{  num_rows: 4}}
> {{  num_row_groups: 1}}
> {{  format_version: 2.6}}
> {{  serialized_size: 1510}}
> {{>>> pq.read_metadata('part-0.parquet').schema}}
> {{}}
> {{required group field_id=-1 schema {}}
> {{  optional binary field_id=-1 name (String);}}
> {{  optional double field_id=-1 age;}}
> {{  optional double field_id=-1 age_2;}}
> {{  optional double field_id=-1 age_4;}}
> {{}}}
> "age_2" and "age_4" look no different from "age" based on the schema. I tried 
> changing the names (just letters) but I still get the same behavior.
> Is something wrong with my file? Is something wrong with Spark?
> (I'll attach the file in a minute, I just need to figure out how.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40873) Spark doesn't see some Parquet columns written from r-arrow

2022-10-21 Thread Daniel Darabos (Jira)
Daniel Darabos created SPARK-40873:
--

 Summary: Spark doesn't see some Parquet columns written from 
r-arrow
 Key: SPARK-40873
 URL: https://issues.apache.org/jira/browse/SPARK-40873
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Daniel Darabos
 Attachments: part-0.parquet

I have a Parquet file that was created in R with the r-arrow package version 
9.0.0 from Conda Forge with the write_dataset() function. It has four columns, 
but Spark 3.3.0 only sees two of them.

{{>>> df = spark.read.parquet('part-0.parquet')}}
{{()}}
{{>>> df.head()}}
{{Row(name='Adam', age=20.0)}}
{{>>> df.columns}}
{{['name', 'age']}}
{{>>> import pandas as pd}}
{{>>> pd.read_parquet('part-0.parquet')}}
{{           name   age   age_2      age_4}}
{{0          Adam  20.0   400.0   16.0}}
{{1           Eve  18.0   324.0   104976.0}}
{{2           Bob  50.0  2500.0  625.0}}
{{3  Isolated Joe   2.0     4.0       16.0}}
{{>>> import pyarrow as pa}}
{{>>> import pyarrow.parquet as pq}}
{{>>> t = pq.read_table('part-0.parquet')}}
{{>>> t}}
{{pyarrow.Table}}
{{name: string}}
{{age: double}}
{{age_2: double}}
{{age_4: double}}
{{}}
{{name: [["Adam","Eve","Bob","Isolated Joe"]]}}
{{age: [[20,18,50,2]]}}
{{age_2: [[400,324,2500,4]]}}
{{age_4: [[16,104976,625,16]]}}
{{>>> pq.read_metadata('part-0.parquet')}}
{{}}
{{  created_by: parquet-cpp-arrow version 9.0.0}}
{{  num_columns: 4}}
{{  num_rows: 4}}
{{  num_row_groups: 1}}
{{  format_version: 2.6}}
{{  serialized_size: 1510}}
{{>>> pq.read_metadata('part-0.parquet').schema}}
{{}}
{{required group field_id=-1 schema {}}
{{  optional binary field_id=-1 name (String);}}
{{  optional double field_id=-1 age;}}
{{  optional double field_id=-1 age_2;}}
{{  optional double field_id=-1 age_4;}}
{{}}}

"age_2" and "age_4" look no different from "age" based on the schema. I tried 
changing the names (just letters) but I still get the same behavior.

Is something wrong with my file? Is something wrong with Spark?

(I'll attach the file in a minute, I just need to figure out how.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40873) Spark doesn't see some Parquet columns written from r-arrow

2022-10-21 Thread Daniel Darabos (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-40873:
---
Attachment: part-0.parquet

> Spark doesn't see some Parquet columns written from r-arrow
> ---
>
> Key: SPARK-40873
> URL: https://issues.apache.org/jira/browse/SPARK-40873
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Daniel Darabos
>Priority: Minor
> Attachments: part-0.parquet
>
>
> I have a Parquet file that was created in R with the r-arrow package version 
> 9.0.0 from Conda Forge with the write_dataset() function. It has four 
> columns, but Spark 3.3.0 only sees two of them.
> {{>>> df = spark.read.parquet('part-0.parquet')}}
> {{()}}
> {{>>> df.head()}}
> {{Row(name='Adam', age=20.0)}}
> {{>>> df.columns}}
> {{['name', 'age']}}
> {{>>> import pandas as pd}}
> {{>>> pd.read_parquet('part-0.parquet')}}
> {{           name   age   age_2      age_4}}
> {{0          Adam  20.0   400.0   16.0}}
> {{1           Eve  18.0   324.0   104976.0}}
> {{2           Bob  50.0  2500.0  625.0}}
> {{3  Isolated Joe   2.0     4.0       16.0}}
> {{>>> import pyarrow as pa}}
> {{>>> import pyarrow.parquet as pq}}
> {{>>> t = pq.read_table('part-0.parquet')}}
> {{>>> t}}
> {{pyarrow.Table}}
> {{name: string}}
> {{age: double}}
> {{age_2: double}}
> {{age_4: double}}
> {{}}
> {{name: [["Adam","Eve","Bob","Isolated Joe"]]}}
> {{age: [[20,18,50,2]]}}
> {{age_2: [[400,324,2500,4]]}}
> {{age_4: [[16,104976,625,16]]}}
> {{>>> pq.read_metadata('part-0.parquet')}}
> {{}}
> {{  created_by: parquet-cpp-arrow version 9.0.0}}
> {{  num_columns: 4}}
> {{  num_rows: 4}}
> {{  num_row_groups: 1}}
> {{  format_version: 2.6}}
> {{  serialized_size: 1510}}
> {{>>> pq.read_metadata('part-0.parquet').schema}}
> {{}}
> {{required group field_id=-1 schema {}}
> {{  optional binary field_id=-1 name (String);}}
> {{  optional double field_id=-1 age;}}
> {{  optional double field_id=-1 age_2;}}
> {{  optional double field_id=-1 age_4;}}
> {{}}}
> "age_2" and "age_4" look no different from "age" based on the schema. I tried 
> changing the names (just letters) but I still get the same behavior.
> Is something wrong with my file? Is something wrong with Spark?
> (I'll attach the file in a minute, I just need to figure out how.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2022-08-12 Thread Daniel Darabos (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17578901#comment-17578901
 ] 

Daniel Darabos commented on SPARK-37690:


It's fixed in Spark 3.3.0. 
(https://github.com/apache/spark/commit/1d068cef38f2323967be83045118cef0e537e8dc)
 Does upgrading count as a workaround?

Or on 3.2 you can avoid the cycle error by saving the new table under a new 
name. 

> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
> Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python}from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2022-01-17 Thread Daniel Darabos (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477141#comment-17477141
 ] 

Daniel Darabos commented on SPARK-37690:


We've hit this too with Spark 3.2.0. Could this be fallout from 
[SPARK-34546|https://issues.apache.org/jira/browse/SPARK-34546]? It changed 
where the query for views is exactly analyzed, and it was added in 3.2.0. 
[~imback82], what do you think?

Here's a repro for the Scala Spark Shell:
{code:java}
scala> Seq((1, 2)).toDF.createOrReplaceTempView("x")
scala> spark.sql("select * from x").createOrReplaceTempView("x")
org.apache.spark.sql.AnalysisException: Recursive view `x` detected (cycle: `x` 
-> `x`)
{code}
 

 

> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
> Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python}from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-11-23 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697319#comment-16697319
 ] 

Daniel Darabos commented on SPARK-20144:


So where do we go from here? Should I try to find a reviewer?

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-10-15 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650817#comment-16650817
 ] 

Daniel Darabos commented on SPARK-20144:


Thanks, those are good questions.

# The global option is not great, but it's the simplest. The code is already 
controlled by two global options. ({{spark.sql.files.maxPartitionBytes}} and 
{{spark.sql.files.openCostInBytes}}.) Why not one more?
 # I'm not sure what {{LOAD DATA INPATH}} does. (Sorry...) But sure, users can 
put random-name files in the directory and mess stuff up. Best protection 
against that is not putting random-name files in the directory. :D
 # The whole problem is not Parquet-specific. It affects all file types. The 
{{part-1}} naming comes from Hadoop's 
[FileOutputFormat|https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.java#L270].
 It's been like this forever and will never change. (I'd say it's more than a 
convention.)

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-10-15 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650722#comment-16650722
 ] 

Daniel Darabos commented on SPARK-20144:


Yeah, I'm not too happy about the alphabetical ordering either. I thought I 
could simply not sort, and get the "original" order. But at the point where I 
made my change, the files are already in a jumbled order. Maybe it's the file 
system listing order, which could be anything.

99% of the time I'm just reading back a single partitioned Parquet file. In 
this case the alphabetical ordering is the right ordering. ({{part-1}}, 
{{part-2}}, ...) The rows of the resulting DataFrame will be in the same 
order as originally. So I think this issue is satisfied by the change. (The 
test also demonstrates this.)

The 1% case (for me) is when I'm reading back multiple Parquet files with a 
glob in a single {{spark.read.parquet("dir-\{0,5,10}")}} call. In this case it 
would be nice to respect the order given by the user ({{dir-0}}, {{dir-5}}, 
{{dir-10}}). My PR messes this up. ({{dir-0}}, {{dir-10}}, {{dir-5}}) But at 
least the partitions within each Parquet file will be contiguous. That's still 
an improvement.

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-10-15 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650492#comment-16650492
 ] 

Daniel Darabos commented on SPARK-20144:


Thanks Victor! I've expanded the test with a case where reordering is allowed, 
and I've added some explanatory comments in the test.

[~dongjoon], what do you think? Should I try to foster more discussion? Or what 
could be a next step?

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data

2018-10-08 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16642401#comment-16642401
 ] 

Daniel Darabos commented on SPARK-20144:


Sorry, I had an idea for a quick fix for this and sent a pull request without 
discussing it first. Let me copy the rationale from the PR:

I'm adding {{spark.sql.files.allowReordering}}, defaulting to {{true}}. When 
set to {{true}} the behavior is as before. When set to {{false}}, the input 
files are read in alphabetical order. This means partitions are read in the 
{{part-1}}, {{part-2}}, {{part-3}}... order, recovering the same 
ordering as before.

While *SPARK-20144* has been closed as "Not A Problem", I think this is still a 
valuable feature. Spark has been 
[touted|https://databricks.com/blog/2016/11/14/setting-new-world-record-apache-spark.html]
 as the best tool for sorting. It certainly can sort data. But without this 
change, it can not read back sorted data on the DataFrame API.

My practical use case is that we allow users to run their SQL expressions 
through our UI. We also allow them to ask for the results to be persisted to 
Parquet files. We noticed that if they do an {{ORDER BY}}, the ordering is lost 
if they also ask for persistence. For example they might want to rank data 
points by a score, so they can later get the top 10 or top 10,000,000 entries 
easily. With this change we could fulfill this use case.

The fix is small and safe. (25 lines including test and docs, only changes 
behavior when new flag is set.) Is there a reason not to do this?

> spark.read.parquet no long maintains ordering of the data
> -
>
> Key: SPARK-20144
> URL: https://issues.apache.org/jira/browse/SPARK-20144
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Li Jin
>Priority: Major
>
> Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is 
> when we read parquet files in 2.0.2, the ordering of rows in the resulting 
> dataframe is not the same as the ordering of rows in the dataframe that the 
> parquet file was reproduced with. 
> This is because FileSourceStrategy.scala combines the parquet files into 
> fewer partitions and also reordered them. This breaks our workflows because 
> they assume the ordering of the data. 
> Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec 
> changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with 
> 2.1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23207) Shuffle+Repartition on an DataFrame could lead to incorrect answers

2018-08-23 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590435#comment-16590435
 ] 

Daniel Darabos commented on SPARK-23207:


Sorry, could you clarify the fix version please? "Affects version" and "Fix 
version" are both set to 2.3.0. And GitHub only shows 
[https://github.com/apache/spark/commit/94c67a76ec1fda908a671a47a2a1fa63b3ab1b06]
 on master, not on a release tag. Thanks! (Also thanks for the fix.)

> Shuffle+Repartition on an DataFrame could lead to incorrect answers
> ---
>
> Key: SPARK-23207
> URL: https://issues.apache.org/jira/browse/SPARK-23207
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>Assignee: Jiang Xingbo
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.0
>
>
> Currently shuffle repartition uses RoundRobinPartitioning, the generated 
> result is nondeterministic since the sequence of input rows are not 
> determined.
> The bug can be triggered when there is a repartition call following a shuffle 
> (which would lead to non-deterministic row ordering), as the pattern shows 
> below:
> upstream stage -> repartition stage -> result stage
> (-> indicate a shuffle)
> When one of the executors process goes down, some tasks on the repartition 
> stage will be retried and generate inconsistent ordering, and some tasks of 
> the result stage will be retried generating different data.
> The following code returns 931532, instead of 100:
> {code}
> import scala.sys.process._
> import org.apache.spark.TaskContext
> val res = spark.range(0, 1000 * 1000, 1).repartition(200).map { x =>
>   x
> }.repartition(200).map { x =>
>   if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 2) {
> throw new Exception("pkill -f java".!!)
>   }
>   x
> }
> res.distinct().count()
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25146) avg() returns null on some decimals

2018-08-17 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16584015#comment-16584015
 ] 

Daniel Darabos commented on SPARK-25146:


Wonderful, thanks! Sorry I missed the fix.

> avg() returns null on some decimals
> ---
>
> Key: SPARK-25146
> URL: https://issues.apache.org/jira/browse/SPARK-25146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Daniel Darabos
>Priority: Major
>
> We compute some 0-10 numbers in a pipeline using Spark SQL. Then we average 
> them. The average in some cases comes out to {{null}} to our surprise (and 
> disappointment).
> After a bit of digging it looks like these numbers have ended up with the 
> {{decimal(37,30)}} type. I've got a Spark Shell (2.3.0 and 2.3.1) repro with 
> this type:
> {code}
> scala> (1 to 1).map(_*0.001).toDF.createOrReplaceTempView("x")
> scala> spark.sql("select cast(value as decimal(37, 30)) as v from 
> x").createOrReplaceTempView("x")
> scala> spark.sql("select avg(v) from x").show
> +--+
> |avg(v)|
> +--+
> |  null|
> +--+
> {code}
> For up to 4471 numbers it is able to calculate the average. For 4472 or more 
> numbers it's {{null}}.
> Now I'll just change these numbers to {{double}}. But we got the types 
> entirely automatically. We never asked for {{decimal}}. If this is the 
> default type, it's important to support averaging a handful of them. (Sorry 
> for the bitterness. I like {{double}} more. :))
> Curiously, {{sum()}} works. And {{count()}} too. So it's quite the surprise 
> that {{avg()}} fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25146) avg() returns null on some decimals

2018-08-17 Thread Daniel Darabos (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-25146:
---
Description: 
We compute some 0-10 numbers in a pipeline using Spark SQL. Then we average 
them. The average in some cases comes out to {{null}} to our surprise (and 
disappointment).

After a bit of digging it looks like these numbers have ended up with the 
{{decimal(37,30)}} type. I've got a Spark Shell (2.3.0 and 2.3.1) repro with 
this type:

{code}
scala> (1 to 1).map(_*0.001).toDF.createOrReplaceTempView("x")

scala> spark.sql("select cast(value as decimal(37, 30)) as v from 
x").createOrReplaceTempView("x")

scala> spark.sql("select avg(v) from x").show
+--+
|avg(v)|
+--+
|  null|
+--+
{code}

For up to 4471 numbers it is able to calculate the average. For 4472 or more 
numbers it's {{null}}.

Now I'll just change these numbers to {{double}}. But we got the types entirely 
automatically. We never asked for {{decimal}}. If this is the default type, 
it's important to support averaging a handful of them. (Sorry for the 
bitterness. I like {{double}} more. :))

Curiously, {{sum()}} works. And {{count()}} too. So it's quite the surprise 
that {{avg()}} fails.

  was:
We compute some 0-10 numbers in a pipeline using Spark SQL. Then we average 
them. The average in some cases comes out to {{null}} to our surprise (and 
disappointment).

After a bit of digging it looks like these numbers have ended up with the 
{{decimal(37,30)}} type. I've got a Spark Shell (2.3.0 and 2.3.1) repro with 
this type:

{{scala> (1 to 1).map(_*0.001).toDF.createOrReplaceTempView("x")}}

{{scala> spark.sql("select cast(value as decimal(37, 30)) as v from 
x").createOrReplaceTempView("x")}}

{{scala> spark.sql("select avg(v) from x").show}}

{{+--+}}
{{|avg(v)|}}
{{+--+}}
{{|  null|}}
{{+--+}}

For up to 4471 numbers it is able to calculate the average. For 4472 or more 
numbers it's {{null}}.

Now I'll just change these numbers to {{double}}. But we got the types entirely 
automatically. We never asked for {{decimal}}. If this is the default type, 
it's important to support averaging a handful of them. (Sorry for the 
bitterness. I like {{double}} more. :))

Curiously, {{sum()}} works. And {{count()}} too. So it's quite the surprise 
that {{avg()}} fails.


> avg() returns null on some decimals
> ---
>
> Key: SPARK-25146
> URL: https://issues.apache.org/jira/browse/SPARK-25146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Daniel Darabos
>Priority: Major
>
> We compute some 0-10 numbers in a pipeline using Spark SQL. Then we average 
> them. The average in some cases comes out to {{null}} to our surprise (and 
> disappointment).
> After a bit of digging it looks like these numbers have ended up with the 
> {{decimal(37,30)}} type. I've got a Spark Shell (2.3.0 and 2.3.1) repro with 
> this type:
> {code}
> scala> (1 to 1).map(_*0.001).toDF.createOrReplaceTempView("x")
> scala> spark.sql("select cast(value as decimal(37, 30)) as v from 
> x").createOrReplaceTempView("x")
> scala> spark.sql("select avg(v) from x").show
> +--+
> |avg(v)|
> +--+
> |  null|
> +--+
> {code}
> For up to 4471 numbers it is able to calculate the average. For 4472 or more 
> numbers it's {{null}}.
> Now I'll just change these numbers to {{double}}. But we got the types 
> entirely automatically. We never asked for {{decimal}}. If this is the 
> default type, it's important to support averaging a handful of them. (Sorry 
> for the bitterness. I like {{double}} more. :))
> Curiously, {{sum()}} works. And {{count()}} too. So it's quite the surprise 
> that {{avg()}} fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25146) avg() returns null on some decimals

2018-08-17 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-25146:
--

 Summary: avg() returns null on some decimals
 Key: SPARK-25146
 URL: https://issues.apache.org/jira/browse/SPARK-25146
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1, 2.3.0
Reporter: Daniel Darabos


We compute some 0-10 numbers in a pipeline using Spark SQL. Then we average 
them. The average in some cases comes out to {{null}} to our surprise (and 
disappointment).

After a bit of digging it looks like these numbers have ended up with the 
{{decimal(37,30)}} type. I've got a Spark Shell (2.3.0 and 2.3.1) repro with 
this type:

{{scala> (1 to 1).map(_*0.001).toDF.createOrReplaceTempView("x")}}

{{scala> spark.sql("select cast(value as decimal(37, 30)) as v from 
x").createOrReplaceTempView("x")}}

{{scala> spark.sql("select avg(v) from x").show}}

{{+--+}}
{{|avg(v)|}}
{{+--+}}
{{|  null|}}
{{+--+}}

For up to 4471 numbers it is able to calculate the average. For 4472 or more 
numbers it's {{null}}.

Now I'll just change these numbers to {{double}}. But we got the types entirely 
automatically. We never asked for {{decimal}}. If this is the default type, 
it's important to support averaging a handful of them. (Sorry for the 
bitterness. I like {{double}} more. :))

Curiously, {{sum()}} works. And {{count()}} too. So it's quite the surprise 
that {{avg()}} fails.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23666) Undeterministic column name with UDFs

2018-03-13 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396832#comment-16396832
 ] 

Daniel Darabos commented on SPARK-23666:


I've looked at the code and both {{ScalaUDF.scala}} and 
{{mathExpressions.scala}} just call {{toString}} on an {{Expression}} child. I 
don't see why the ID is added in one case and not the other...

> Undeterministic column name with UDFs
> -
>
> Key: SPARK-23666
> URL: https://issues.apache.org/jira/browse/SPARK-23666
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Daniel Darabos
>Priority: Minor
>
> When you access structure fields in Spark SQL, the auto-generated result 
> column name includes an internal ID.
> {code:java}
> scala> import spark.implicits._
> scala> Seq(((1, 2), 3)).toDF("a", "b").createOrReplaceTempView("x")
> scala> spark.udf.register("f", (a: Int) => a)
> scala> spark.sql("select f(a._1) from x").show
> +-+
> |UDF:f(a._1 AS _1#148)|
> +-+
> |1|
> +-+
> {code}
> This ID ({{#148}}) is only included for UDFs.
> {code:java}
> scala> spark.sql("select factorial(a._1) from x").show
> +---+
> |factorial(a._1 AS `_1`)|
> +---+
> |  1|
> +---+
> {code}
> The internal ID is different on every invocation. The problem this causes for 
> us is that the schema of the SQL output is never the same:
> {code:java}
> scala> spark.sql("select f(a._1) from x").schema ==
>spark.sql("select f(a._1) from x").schema
> Boolean = false
> {code}
> We rely on similar schema checks when reloading persisted data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23666) Undeterministic column name with UDFs

2018-03-13 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-23666:
--

 Summary: Undeterministic column name with UDFs
 Key: SPARK-23666
 URL: https://issues.apache.org/jira/browse/SPARK-23666
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.2.0
Reporter: Daniel Darabos


When you access structure fields in Spark SQL, the auto-generated result column 
name includes an internal ID.
{code:java}
scala> import spark.implicits._
scala> Seq(((1, 2), 3)).toDF("a", "b").createOrReplaceTempView("x")
scala> spark.udf.register("f", (a: Int) => a)
scala> spark.sql("select f(a._1) from x").show
+-+
|UDF:f(a._1 AS _1#148)|
+-+
|1|
+-+
{code}
This ID ({{#148}}) is only included for UDFs.
{code:java}
scala> spark.sql("select factorial(a._1) from x").show
+---+
|factorial(a._1 AS `_1`)|
+---+
|  1|
+---+
{code}
The internal ID is different on every invocation. The problem this causes for 
us is that the schema of the SQL output is never the same:
{code:java}
scala> spark.sql("select f(a._1) from x").schema ==
   spark.sql("select f(a._1) from x").schema
Boolean = false
{code}
We rely on similar schema checks when reloading persisted data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21418) NoSuchElementException: None.get in DataSourceScanExec with sun.io.serialization.extendedDebugInfo=true

2017-09-14 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165946#comment-16165946
 ] 

Daniel Darabos commented on SPARK-21418:


Sean's fix should cover you no matter what triggers the unexpected {{toString}} 
call. You could try building from {{master}} (or taking a nightly from 
https://spark.apache.org/developer-tools.html#nightly-builds) to confirm that 
this is the case.

> NoSuchElementException: None.get in DataSourceScanExec with 
> sun.io.serialization.extendedDebugInfo=true
> ---
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> serialization failed: java.util.NoSuchElementException: None.get
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   

[jira] [Commented] (SPARK-21418) NoSuchElementException: None.get on DataFrame.rdd

2017-09-04 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16152742#comment-16152742
 ] 

Daniel Darabos commented on SPARK-21418:


Sorry for the delay. I can confirm that removing 
{{-Dsun.io.serialization.extendedDebugInfo=true}} is the fix. We only use this 
flag when running unit tests, but it's very useful for debugging serialization 
issues. It happens often in Spark that you accidentally include something in a 
closure that cannot be serialized. It's hard to figure out without this flag 
what caused that.

> NoSuchElementException: None.get on DataFrame.rdd
> -
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> serialization failed: java.util.NoSuchElementException: None.get
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60)
>   at 
> org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
>   at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
>   at 
> 

[jira] [Comment Edited] (SPARK-21418) NoSuchElementException: None.get on DataFrame.rdd

2017-09-04 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088952#comment-16088952
 ] 

Daniel Darabos edited comment on SPARK-21418 at 9/4/17 3:49 PM:


I'm on holiday without a computer through the coming week, but I'll try to
dig deeper after that.

I do recall that we enable a JVM flag for printing extra details on
serialization errors. Now I wonder if that flag collects string forms even
when no error happens. I guess I should not be surprised: if it did not,
there would be no reason to ever disable this feature.

That already suggests an easy workaround :). Thanks!


was (Author: darabos):
I'm on holiday without a computer through the coming week, but I'll try to
dig deeper after that.

I do recall that we enable a JVM flag for printing extra details on
serialization errors. Now I wonder if that flag collects string forms even
when no error happens. I guess I should not be surprised: if it did not,
there would be no reason to ever disable this feature.

That already suggests an easy workaround :). Thanks!

On Jul 15, 2017 6:44 PM, "Kazuaki Ishizaki (JIRA)"  wrote:


[ https://issues.apache.org/jira/browse/SPARK-21418?page=
com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel=16088659#comment-16088659 ]

Kazuaki Ishizaki commented on SPARK-21418:
--

I am curious why {java.io.ObjectOutputStream.writeOrdinaryObject} calls
`toString` method. Do you specify some option to run this program for JVM?

following lines in a unit test for our Spark application:
{{collect}} fails:
serialization failed: java.util.NoSuchElementException: None.get
$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.
scala:70)
DataSourceScanExec.scala:54)
DataSourceScanExec.scala:52)
1.apply(TraversableLike.scala:234)
1.apply(TraversableLike.scala:234)
ResizableArray.scala:59)
DataSourceScanExec.scala:52)
DataSourceScanExec.scala:75)
QueryPlan.scala:349)
apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(
DataSourceScanExec.scala:75)
class.verboseString(DataSourceScanExec.scala:60)
DataSourceScanExec.scala:75)
generateTreeString(TreeNode.scala:556)
generateTreeString(WholeStageCodegenExec.scala:451)
generateTreeString(TreeNode.scala:576)
TreeNode.scala:480)
TreeNode.scala:477)
TreeNode.scala:474)
ObjectOutputStream.java:1421)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
writeObject(List.scala:468)
NativeMethodAccessorImpl.java:62)
DelegatingMethodAccessorImpl.java:43)
ObjectStreamClass.java:1028)
ObjectOutputStream.java:1496)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
writeObject(JavaSerializer.scala:43)
serialize(JavaSerializer.scala:100)
DAGScheduler.scala:1003)
scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
DAGScheduler.scala:874)
doOnReceive(DAGScheduler.scala:1677)
onReceive(DAGScheduler.scala:1669)
onReceive(DAGScheduler.scala:1658)
91fa80fe8a2480d64c430bd10f97b3d44c007bcc#diff-2a91a9a59953aa82fa132aaf45bd73
1bR69 from https://issues.apache.org/jira/browse/SPARK-20070. It tries to
redact sensitive information from {{explain}} output. (We are not trying to
explain anything here, so I doubt it is meant to be running in this case.)
When it needs to access some configuration, it tries to take it from the
"current" Spark session, which it just reads from a thread-local variable.
We appear to be on a thread where this variable is not set.
constraint on multi-threaded Spark applications.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


> NoSuchElementException: None.get on DataFrame.rdd
> -
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {noformat}
> 

[jira] [Commented] (SPARK-21418) NoSuchElementException: None.get on DataFrame.rdd

2017-07-16 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16088952#comment-16088952
 ] 

Daniel Darabos commented on SPARK-21418:


I'm on holiday without a computer through the coming week, but I'll try to
dig deeper after that.

I do recall that we enable a JVM flag for printing extra details on
serialization errors. Now I wonder if that flag collects string forms even
when no error happens. I guess I should not be surprised: if it did not,
there would be no reason to ever disable this feature.

That already suggests an easy workaround :). Thanks!

On Jul 15, 2017 6:44 PM, "Kazuaki Ishizaki (JIRA)"  wrote:


[ https://issues.apache.org/jira/browse/SPARK-21418?page=
com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel=16088659#comment-16088659 ]

Kazuaki Ishizaki commented on SPARK-21418:
--

I am curious why {java.io.ObjectOutputStream.writeOrdinaryObject} calls
`toString` method. Do you specify some option to run this program for JVM?

following lines in a unit test for our Spark application:
{{collect}} fails:
serialization failed: java.util.NoSuchElementException: None.get
$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.
scala:70)
DataSourceScanExec.scala:54)
DataSourceScanExec.scala:52)
1.apply(TraversableLike.scala:234)
1.apply(TraversableLike.scala:234)
ResizableArray.scala:59)
DataSourceScanExec.scala:52)
DataSourceScanExec.scala:75)
QueryPlan.scala:349)
apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(
DataSourceScanExec.scala:75)
class.verboseString(DataSourceScanExec.scala:60)
DataSourceScanExec.scala:75)
generateTreeString(TreeNode.scala:556)
generateTreeString(WholeStageCodegenExec.scala:451)
generateTreeString(TreeNode.scala:576)
TreeNode.scala:480)
TreeNode.scala:477)
TreeNode.scala:474)
ObjectOutputStream.java:1421)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
writeObject(List.scala:468)
NativeMethodAccessorImpl.java:62)
DelegatingMethodAccessorImpl.java:43)
ObjectStreamClass.java:1028)
ObjectOutputStream.java:1496)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
ObjectOutputStream.java:1548)
ObjectOutputStream.java:1509)
ObjectOutputStream.java:1432)
writeObject(JavaSerializer.scala:43)
serialize(JavaSerializer.scala:100)
DAGScheduler.scala:1003)
scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:930)
DAGScheduler.scala:874)
doOnReceive(DAGScheduler.scala:1677)
onReceive(DAGScheduler.scala:1669)
onReceive(DAGScheduler.scala:1658)
91fa80fe8a2480d64c430bd10f97b3d44c007bcc#diff-2a91a9a59953aa82fa132aaf45bd73
1bR69 from https://issues.apache.org/jira/browse/SPARK-20070. It tries to
redact sensitive information from {{explain}} output. (We are not trying to
explain anything here, so I doubt it is meant to be running in this case.)
When it needs to access some configuration, it tries to take it from the
"current" Spark session, which it just reads from a thread-local variable.
We appear to be on a thread where this variable is not set.
constraint on multi-threaded Spark applications.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


> NoSuchElementException: None.get on DataFrame.rdd
> -
>
> Key: SPARK-21418
> URL: https://issues.apache.org/jira/browse/SPARK-21418
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Daniel Darabos
>
> I don't have a minimal reproducible example yet, sorry. I have the following 
> lines in a unit test for our Spark application:
> {code}
> val df = mySparkSession.read.format("jdbc")
>   .options(Map("url" -> url, "dbtable" -> "test_table"))
>   .load()
> df.show
> println(df.rdd.collect)
> {code}
> The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
> fails:
> {noformat}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> serialization failed: java.util.NoSuchElementException: None.get
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
>   at 
> 

[jira] [Commented] (SPARK-20070) Redact datasource explain output

2017-07-14 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16087378#comment-16087378
 ] 

Daniel Darabos commented on SPARK-20070:


I think this change has broken our application. Your input on 
https://issues.apache.org/jira/browse/SPARK-21418 would be greatly appreciated. 
Thanks!

> Redact datasource explain output
> 
>
> Key: SPARK-20070
> URL: https://issues.apache.org/jira/browse/SPARK-20070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
> Fix For: 2.2.0
>
>
> When calling explain on a datasource, the output can contain sensitive 
> information. We should provide an admin/user to redact such information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21418) NoSuchElementException: None.get on DataFrame.rdd

2017-07-14 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-21418:
--

 Summary: NoSuchElementException: None.get on DataFrame.rdd
 Key: SPARK-21418
 URL: https://issues.apache.org/jira/browse/SPARK-21418
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Daniel Darabos


I don't have a minimal reproducible example yet, sorry. I have the following 
lines in a unit test for our Spark application:

{code}
val df = mySparkSession.read.format("jdbc")
  .options(Map("url" -> url, "dbtable" -> "test_table"))
  .load()
df.show
println(df.rdd.collect)
{code}

The output shows the DataFrame contents from {{df.show}}. But the {{collect}} 
fails:

{noformat}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 
serialization failed: java.util.NoSuchElementException: None.get
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:347)
  at scala.None$.get(Option.scala:345)
  at 
org.apache.spark.sql.execution.DataSourceScanExec$class.org$apache$spark$sql$execution$DataSourceScanExec$$redact(DataSourceScanExec.scala:70)
  at 
org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:54)
  at 
org.apache.spark.sql.execution.DataSourceScanExec$$anonfun$4.apply(DataSourceScanExec.scala:52)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.AbstractTraversable.map(Traversable.scala:104)
  at 
org.apache.spark.sql.execution.DataSourceScanExec$class.simpleString(DataSourceScanExec.scala:52)
  at 
org.apache.spark.sql.execution.RowDataSourceScanExec.simpleString(DataSourceScanExec.scala:75)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:349)
  at 
org.apache.spark.sql.execution.RowDataSourceScanExec.org$apache$spark$sql$execution$DataSourceScanExec$$super$verboseString(DataSourceScanExec.scala:75)
  at 
org.apache.spark.sql.execution.DataSourceScanExec$class.verboseString(DataSourceScanExec.scala:60)
  at 
org.apache.spark.sql.execution.RowDataSourceScanExec.verboseString(DataSourceScanExec.scala:75)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:556)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec.generateTreeString(WholeStageCodegenExec.scala:451)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:576)
  at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:480)
  at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:477)
  at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:474)
  at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1421)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
  at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
  at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
  at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
  at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
  at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
  at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
  at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
  at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
  at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
  at 
scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
  at 

[jira] [Created] (SPARK-21136) Misleading error message for typo in SQL

2017-06-19 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-21136:
--

 Summary: Misleading error message for typo in SQL
 Key: SPARK-21136
 URL: https://issues.apache.org/jira/browse/SPARK-21136
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Daniel Darabos
Priority: Minor


{code}
scala> spark.sql("select * from a left joinn b on a.id = b.id").show
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'from' expecting {, 'WHERE', 'GROUP', 'ORDER', 'HAVING', 
'LIMIT', 'LATERAL', 'WINDOW', 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'SORT', 
'CLUSTER', 'DISTRIBUTE'}(line 1, pos 9)

== SQL ==
select * from a left joinn b on a.id = b.id
-^^^
{code}

The issue is that {{^^^}} points at {{from}}, not at {{joinn}}. The text of the 
error makes no sense either. If {{*}}, {{a}}, and {{b}} are complex in 
themselves, a misleading error like this can hinder debugging substantially.

I tried to see if maybe I could fix this. Am I correct to deduce that the error 
message originates in ANTLR4, which parses the query based on the syntax 
defined in {{SqlBase.g4}}? If so, I guess I would have to figure out how that 
syntax definition works, and why it misattributes the error.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19209) "No suitable driver" on first try

2017-01-16 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824224#comment-15824224
 ] 

Daniel Darabos edited comment on SPARK-19209 at 1/16/17 4:26 PM:
-

Sorry, Jira had some issues when I was trying to file this issue; I guess it 
resulted in the duplicates. Also it says I'm watching the issue, but I got no 
mail about your comments. (I checked my spam folder.) Well I have unwatched and 
watched it now, hope it helps.

Yes, it is the same if the table exists:

{code}
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:testdb").option("dbtable", "y").load.show
java.sql.SQLException: No suitable driver
  at java.sql.DriverManager.getDriver(DriverManager.java:315)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
  ... 48 elided

scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:testdb").option("dbtable", "y").load.show
+-+
|value|
+-+
|1|
|2|
|3|
|4|
|5|
|6|
|7|
|8|
|9|
|   10|
+-+
{code}

If I specify the driver class ({{.option("driver", "org.sqlite.JDBC")}}) then 
there is no problem: the method works on the first try. Subsequent tries work 
even if the driver is not specified.

This is not a silver bullet, as our JDBC path typically comes from an external 
source (i.e. the user). But this is definitely a workaround when working in the 
shell. Thanks!


was (Author: darabos):
Sorry, Jira had some issues when I was trying to file this issue; I guess it 
resulted in the duplicates. Also it says I'm watching the issue, but I got no 
mail about your comments. (I checked my spam folder.) Well I have unwatched and 
watched it now, hope it helps.

Yes, it is the same if the table exists:

{verbatim}
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:testdb").option("dbtable", "y").load.show
java.sql.SQLException: No suitable driver
  at java.sql.DriverManager.getDriver(DriverManager.java:315)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
  ... 48 elided

scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:testdb").option("dbtable", "y").load.show
+-+
|value|
+-+
|1|
|2|
|3|
|4|
|5|
|6|
|7|
|8|
|9|
|   10|
+-+
{verbatim}

If I specify the driver class ({{.option("driver", "org.sqlite.JDBC")}}) then 
there is no problem: the method works on the first try. Subsequent tries work 
even if the driver is not specified.

This is not a silver bullet, as our JDBC path typically comes from an external 
source (i.e. the user). But this is definitely a workaround when working in the 
shell. Thanks!

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> 

[jira] [Commented] (SPARK-19209) "No suitable driver" on first try

2017-01-16 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824224#comment-15824224
 ] 

Daniel Darabos commented on SPARK-19209:


Sorry, Jira had some issues when I was trying to file this issue; I guess it 
resulted in the duplicates. Also it says I'm watching the issue, but I got no 
mail about your comments. (I checked my spam folder.) Well I have unwatched and 
watched it now, hope it helps.

Yes, it is the same if the table exists:

{verbatim}
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:testdb").option("dbtable", "y").load.show
java.sql.SQLException: No suitable driver
  at java.sql.DriverManager.getDriver(DriverManager.java:315)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
  ... 48 elided

scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:testdb").option("dbtable", "y").load.show
+-+
|value|
+-+
|1|
|2|
|3|
|4|
|5|
|6|
|7|
|8|
|9|
|   10|
+-+
{verbatim}

If I specify the driver class ({{.option("driver", "org.sqlite.JDBC")}}) then 
there is no problem: the method works on the first try. Subsequent tries work 
even if the driver is not specified.

This is not a silver bullet, as our JDBC path typically comes from an external 
source (i.e. the user). But this is definitely a workaround when working in the 
shell. Thanks!

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
> --driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Updated] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-19209:
---
Description: 
This is a regression from Spark 2.0.2. Observe!

{code}
$ ~/spark-2.0.2/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
--driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
[...]
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
table: x)
{code}

This is the "good" exception. Now with Spark 2.1.0:

{code}
$ ~/spark-2.1.0/bin/spark-shell --jars org.xerial.sqlite-jdbc-3.8.11.2.jar 
--driver-class-path org.xerial.sqlite-jdbc-3.8.11.2.jar
[...]
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: No suitable driver
  at java.sql.DriverManager.getDriver(DriverManager.java:315)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
  ... 48 elided

scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
table: x)
{code}

Simply re-executing the same command a second time "fixes" the {{No suitable 
driver}} error.

My guess is this is fallout from https://github.com/apache/spark/pull/15292 
which changed the JDBC driver management code. But this code is so hard to 
understand for me, I could be totally wrong.

This is nothing more than a nuisance for {{spark-shell}} usage, but it is more 
painful to work around for applications.

  was:
This is a regression from Spark 2.0.2. Observe!

{code}
$ ~/spark-2.0.2/bin/spark-shell --jars 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
[...]
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
table: x)
{code}

This is the "good" exception. Now with Spark 2.1.0:

{code}
$ ~/spark-2.1.0/bin/spark-shell --jars 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
[...]
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: No suitable driver
  at java.sql.DriverManager.getDriver(DriverManager.java:315)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
  ... 48 elided

scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
table: x)
{code}

Simply re-executing the same command a second time "fixes" the {{No suitable 
driver}} error.

My guess is this is fallout from https://github.com/apache/spark/pull/15292 
which changed the JDBC driver management code. But this code is so hard to 
understand for me, I could be totally wrong.

This is nothing more than a nuisance for {{spark-shell}} usage, but it is more 
painful to work around for applications.


> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>

[jira] [Commented] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15821578#comment-15821578
 ] 

Daniel Darabos commented on SPARK-19209:


Puzzlingly this only happens in the application when the SparkSession is 
created with {{enableHiveSupport}}. I guess in {{spark-shell}} it is enabled by 
default.

> "No suitable driver" on first try
> -
>
> Key: SPARK-19209
> URL: https://issues.apache.org/jira/browse/SPARK-19209
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Daniel Darabos
>
> This is a regression from Spark 2.0.2. Observe!
> {code}
> $ ~/spark-2.0.2/bin/spark-shell --jars 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> This is the "good" exception. Now with Spark 2.1.0:
> {code}
> $ ~/spark-2.1.0/bin/spark-shell --jars 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
> stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
> [...]
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: No suitable driver
>   at java.sql.DriverManager.getDriver(DriverManager.java:315)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   ... 48 elided
> scala> spark.read.format("jdbc").option("url", 
> "jdbc:sqlite:").option("dbtable", "x").load
> java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
> table: x)
> {code}
> Simply re-executing the same command a second time "fixes" the {{No suitable 
> driver}} error.
> My guess is this is fallout from https://github.com/apache/spark/pull/15292 
> which changed the JDBC driver management code. But this code is so hard to 
> understand for me, I could be totally wrong.
> This is nothing more than a nuisance for {{spark-shell}} usage, but it is 
> more painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19209) "No suitable driver" on first try

2017-01-13 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-19209:
--

 Summary: "No suitable driver" on first try
 Key: SPARK-19209
 URL: https://issues.apache.org/jira/browse/SPARK-19209
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0
Reporter: Daniel Darabos


This is a regression from Spark 2.0.2. Observe!

{code}
$ ~/spark-2.0.2/bin/spark-shell --jars 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
[...]
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
table: x)
{code}

This is the "good" exception. Now with Spark 2.1.0:

{code}
$ ~/spark-2.1.0/bin/spark-shell --jars 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar --driver-class-path 
stage/lib/org.xerial.sqlite-jdbc-3.8.11.2.jar
[...]
scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: No suitable driver
  at java.sql.DriverManager.getDriver(DriverManager.java:315)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions$$anonfun$7.apply(JDBCOptions.scala:84)
  at scala.Option.getOrElse(Option.scala:121)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:83)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
  at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
  ... 48 elided

scala> spark.read.format("jdbc").option("url", 
"jdbc:sqlite:").option("dbtable", "x").load
java.sql.SQLException: [SQLITE_ERROR] SQL error or missing database (no such 
table: x)
{code}

Simply re-executing the same command a second time "fixes" the {{No suitable 
driver}} error.

My guess is this is fallout from https://github.com/apache/spark/pull/15292 
which changed the JDBC driver management code. But this code is so hard to 
understand for me, I could be totally wrong.

This is nothing more than a nuisance for {{spark-shell}} usage, but it is more 
painful to work around for applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16625) Oracle JDBC table creation fails with ORA-00902: invalid datatype

2016-07-19 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-16625:
--

 Summary: Oracle JDBC table creation fails with ORA-00902: invalid 
datatype
 Key: SPARK-16625
 URL: https://issues.apache.org/jira/browse/SPARK-16625
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.2
Reporter: Daniel Darabos


Unfortunately I know very little about databases, but I figure this is a bug.

I have a DataFrame with the following schema: 
{noformat}
StructType(StructField(dst,StringType,true), StructField(id,LongType,true), 
StructField(src,StringType,true))
{noformat}

I am trying to write it to an Oracle database like this:

{code:java}
String url = "jdbc:oracle:thin:root/rootroot@:1521:db";
java.util.Properties p = new java.util.Properties();
p.setProperty("driver", "oracle.jdbc.OracleDriver");
df.write().mode("overwrite").jdbc(url, "my_table", p);
{code}

And I get:

{noformat}
Exception in thread "main" java.sql.SQLSyntaxErrorException: ORA-00902: invalid 
datatype

at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461)
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402)
at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1108)
at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:541)
at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:264)
at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:598)
at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:213)
at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:26)
at 
oracle.jdbc.driver.T4CStatement.executeForRows(T4CStatement.java:1241)
at 
oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1558)
at 
oracle.jdbc.driver.OracleStatement.executeUpdateInternal(OracleStatement.java:2498)
at 
oracle.jdbc.driver.OracleStatement.executeUpdate(OracleStatement.java:2431)
at 
oracle.jdbc.driver.OracleStatementWrapper.executeUpdate(OracleStatementWrapper.java:975)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:302)
{noformat}

The Oracle server I am running against is the one I get on Amazon RDS for 
engine type {{oracle-se}}. The same code (with the right driver) against the 
RDS instance with engine type {{MySQL}} works.

The error message is the same as in 
https://issues.apache.org/jira/browse/SPARK-12941. Could it be that {{Long}} is 
also translated into the wrong data type? Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12964) SparkContext.localProperties leaked

2016-07-15 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-12964:
---
Description: 
I have a non-deterministic but quite reliable reproduction for a case where 
{{spark.sql.execution.id}} is leaked. Operations then die with 
{{spark.sql.execution.id is already set}}. These threads never recover because 
there is nothing to unset {{spark.sql.execution.id}}. (It's not a case of 
nested {{withNewExecutionId}} calls.)

I have figured out why this happens. We are within a {{withNewExecutionId}} 
block. At some point we call back to user code. (In our case this is a custom 
data source's {{buildScan}} method.) The user code calls 
{{scala.concurrent.Await.result}}. Because our thread is a member of a 
{{ForkJoinPool}} (this is a Play HTTP serving thread) {{Await.result}} starts a 
new thread. {{SparkContext.localProperties}} is cloned for this thread and then 
it's ready to serve an HTTP request.

The first thread then finishes waiting, finishes {{buildScan}}, and leaves 
{{withNewExecutionId}}, clearing {{spark.sql.execution.id}} in the {{finally}} 
block. All good. But some time later another HTTP request will be served by the 
second thread. This thread is "poisoned" with a {{spark.sql.execution.id}}. 
When it tries to use {{withNewExecutionId}} it fails.



I don't know who's at fault here. 

 - I don't like the {{ThreadLocal}} properties anyway. Why not create an 
Execution object and let it wrap the operation? Then you could have two 
executions in parallel on the same thread, and other stuff like that. It would 
be much clearer than storing the execution ID in a kind-of-global variable.
 - Why do we have to inherit the {{ThreadLocal}} properties? I'm sure there is 
a good reason, but this is essentially a bug-generator in my view. (It has 
already generated https://issues.apache.org/jira/browse/SPARK-10563.)
 - {{Await.result}} --- I never would have thought it starts threads.
 - We probably shouldn't be calling {{Await.result}} inside {{buildScan}}.
 - We probably shouldn't call Spark things from HTTP serving threads.

I'm not sure what could be done on the Spark side, but I thought I should 
mention this interesting issue. For supporting evidence here is the stack trace 
when {{localProperties}} is getting cloned. It's contents at that point are:

{noformat}
{spark.sql.execution.id=0, spark.rdd.scope.noOverride=true, 
spark.rdd.scope={"id":"4","name":"ExecutedCommand"}}
{noformat}

{noformat}
  at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:364) 
[spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0]
  at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:362) 
[spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0]
  at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:353) 
[na:1.7.0_91]
  at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:261) 
[na:1.7.0_91]
  at java.lang.ThreadLocal.createInheritedMap(ThreadLocal.java:236) 
[na:1.7.0_91]   
  at java.lang.Thread.init(Thread.java:416) [na:1.7.0_91]   

  at java.lang.Thread.init(Thread.java:349) [na:1.7.0_91]   

  at java.lang.Thread.(Thread.java:508) [na:1.7.0_91] 

  at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.(ForkJoinWorkerThread.java:48)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.(ExecutionContextImpl.scala:42)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory.newThread(ExecutionContextImpl.scala:42)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.forkjoin.ForkJoinPool.tryCompensate(ForkJoinPool.java:2341) 
[org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3638) 
[org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.blockOn(ExecutionContextImpl.scala:45)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at scala.concurrent.Await$.result(package.scala:107) 
[org.scala-lang.scala-library-2.10.5.jar:na] 
  at 
com.lynxanalytics.biggraph.graph_api.SafeFuture.awaitResult(SafeFuture.scala:50)
 [biggraph.jar]
  at 
com.lynxanalytics.biggraph.graph_api.DataManager.get(DataManager.scala:315) 
[biggraph.jar] 
  at 
com.lynxanalytics.biggraph.graph_api.Scripting$.getData(Scripting.scala:87) 
[biggraph.jar] 
  at 
com.lynxanalytics.biggraph.table.TableRelation$$anonfun$1.apply(DefaultSource.scala:46)
 [biggraph.jar]
  at 
com.lynxanalytics.biggraph.table.TableRelation$$anonfun$1.apply(DefaultSource.scala:46)
 [biggraph.jar]
  at 

[jira] [Commented] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

2016-06-07 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318523#comment-15318523
 ] 

Daniel Darabos commented on SPARK-15796:


> The only argument against it was that it's specific to the OpenJDK default.

I think Gabor has only tested with OpenJDK, but the default for {{NewRatio}} is 
the same in Oracle Java 8 Server JVM according to 
https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/sizing.html.

> I think this issue still exists even with the fraction set to 0.66, because 
> of course if you are using any memory at all for other stuff, some of that 
> can't fit in the old generation. There will always be some need to tune GC 
> params when that becomes the bottleneck.

Good point. Maybe 0.6 would be the best default? If everything fit in old-gen 
in 1.5, it would probably still fit in the old-gen that way.

> Spark 1.6 default memory settings can cause heavy GC when caching
> -
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * http://spark.apache.org/docs/1.5.2/configuration.html even says that it 
> shouldn't be bigger than the size of the old generation.
> * On the other hand, OpenJDK's default NewRatio is 2, which means an old 
> generation size of 66%. Hence the default value in Spark 1.6 contradicts this 
> advice.
> http://spark.apache.org/docs/1.6.1/tuning.html recommends that if the old 
> generation is running close to full, then setting 
> spark.memory.storageFraction to a lower value should help. I have tried with 
> spark.memory.storageFraction=0.1, but it still doesn't fix the issue. This is 
> not a surprise: http://spark.apache.org/docs/1.6.1/configuration.html 
> explains that storageFraction is not an upper-limit but a lower limit-like 
> thing on the size of Spark's 

[jira] [Commented] (SPARK-15796) Spark 1.6 default memory settings can cause heavy GC when caching

2016-06-07 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15318486#comment-15318486
 ] 

Daniel Darabos commented on SPARK-15796:


The example program takes less than a minute on Spark 1.5 and 5 minutes on 
Spark 1.6, using the default configuration in both cases. In neither case do we 
run out of memory.

The old generation size defaults to 66% and Spark caching in Spark 1.5 defaults 
to 60%, so with default settings the cache fits in the old generation in 1.5. 
But in 1.6 the default cache size is increased to 75% so it no longer fits in 
the old generation. This kills performance. (And the regression is very hard to 
debug. Kudos to Gabor Feher!)

The default settings have been changed in Spark 1.6 to give a 5x slowdown, and 
the documentation for the current settings does not make a note of this. Only 
the documentation for the deprecated {{spark.storage.memoryFraction}} mentions 
the issue, but its default value had been chosen so that the issue was not 
triggered by default. This also has to be documented for the new settings.

Unless someone never uses cache, they are going to hit this issue if they run 
with the default settings. I think this is bad enough to warrant changing the 
defaults. I propose defaulting {{spark.memory.fraction}} to 0.6. If someone 
wants to set {{spark.memory.fraction}} to 0.75 they need to also set 
{{-XX:NewRatio=3}} to avoid GC thrashing. (Another option is to set 
{{-XX:NewRatio=3}} by default, but I think it's a vendor-specific flag.)

What is the argument against defaulting {{spark.memory.fraction}} to 0.6?

> Spark 1.6 default memory settings can cause heavy GC when caching
> -
>
> Key: SPARK-15796
> URL: https://issues.apache.org/jira/browse/SPARK-15796
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gabor Feher
>Priority: Minor
>
> While debugging performance issues in a Spark program, I've found a simple 
> way to slow down Spark 1.6 significantly by filling the RDD memory cache. 
> This seems to be a regression, because setting 
> "spark.memory.useLegacyMode=true" fixes the problem. Here is a repro that is 
> just a simple program that fills the memory cache of Spark using a 
> MEMORY_ONLY cached RDD (but of course this comes up in more complex 
> situations, too):
> {code}
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.storage.StorageLevel
> object CacheDemoApp { 
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("Cache Demo Application")   
> 
> val sc = new SparkContext(conf)
> val startTime = System.currentTimeMillis()
>   
> 
> val cacheFiller = sc.parallelize(1 to 5, 1000)
> 
>   .mapPartitionsWithIndex {
> case (ix, it) =>
>   println(s"CREATE DATA PARTITION ${ix}") 
> 
>   val r = new scala.util.Random(ix)
>   it.map(x => (r.nextLong, r.nextLong))
>   }
> cacheFiller.persist(StorageLevel.MEMORY_ONLY)
> cacheFiller.foreach(identity)
> val finishTime = System.currentTimeMillis()
> val elapsedTime = (finishTime - startTime) / 1000
> println(s"TIME= $elapsedTime s")
>   }
> }
> {code}
> If I call it the following way, it completes in around 5 minutes on my 
> Laptop, while often stopping for slow Full GC cycles. I can also see with 
> jvisualvm (Visual GC plugin) that the old generation of JVM is 96.8% filled.
> {code}
> sbt package
> ~/spark-1.6.0/bin/spark-submit \
>   --class "CacheDemoApp" \
>   --master "local[2]" \
>   --driver-memory 3g \
>   --driver-java-options "-XX:+PrintGCDetails" \
>   target/scala-2.10/simple-project_2.10-1.0.jar
> {code}
> If I add any one of the below flags, then the run-time drops to around 40-50 
> seconds and the difference is coming from the drop in GC times:
>   --conf "spark.memory.fraction=0.6"
> OR
>   --conf "spark.memory.useLegacyMode=true"
> OR
>   --driver-java-options "-XX:NewRatio=3"
> All the other cache types except for DISK_ONLY produce similar symptoms. It 
> looks like that the problem is that the amount of data Spark wants to store 
> long-term ends up being larger than the old generation size in the JVM and 
> this triggers Full GC repeatedly.
> I did some research:
> * In Spark 1.6, spark.memory.fraction is the upper limit on cache size. It 
> defaults to 0.75.
> * In Spark 1.5, spark.storage.memoryFraction is the upper limit in cache 
> size. It defaults to 0.6 and...
> * 

[jira] [Updated] (SPARK-11293) Spillable collections leak shuffle memory

2016-03-23 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-11293:
---
Affects Version/s: 1.6.1

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0, 1.6.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13620) Avoid reverse DNS lookup for 0.0.0.0 on startup

2016-03-03 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-13620:
---
Fix Version/s: 2.0.0

> Avoid reverse DNS lookup for 0.0.0.0 on startup
> ---
>
> Key: SPARK-13620
> URL: https://issues.apache.org/jira/browse/SPARK-13620
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Daniel Darabos
>Priority: Minor
> Fix For: 2.0.0
>
>
> I noticed we spend 5+ seconds during application startup with the following 
> stack trace:
> {code}
> at java.net.Inet6AddressImpl.getHostByAddr(Native Method)
> at java.net.InetAddress$1.getHostByAddr(InetAddress.java:926)
> at java.net.InetAddress.getHostFromNameService(InetAddress.java:611)
> at java.net.InetAddress.getHostName(InetAddress.java:553)
> at java.net.InetAddress.getHostName(InetAddress.java:525)
> at 
> java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
> at 
> java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
> at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
> at org.spark-project.jetty.server.Server.(Server.java:115)
> at 
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:243)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
> at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955)
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
> at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
> at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
> at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.SparkContext.(SparkContext.scala:481)
> {code}
> Spark wants to start a server on localhost. So it [creates an 
> {{InetSocketAddress}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L243]
>  [with hostname 
> {{"0.0.0.0"}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L136].
>  Spark passes in a hostname string, but Java [recognizes that it's actually 
> an 
> address|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L220]
>  and so sets the hostname to {{null}}. So when Jetty [calls 
> {{getHostName}}|https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java#L115]
>  Java has to do a reverse DNS lookup for {{0.0.0.0}}. That takes 5+ seconds 
> on my machine. Maybe it's just me? It's a very vanilla Ubuntu 14.04.
> There is a simple fix. Instead of passing in {{"0.0.0.0"}} we should not set 
> a hostname. In this case [{{InetAddress.anyLocalAddress()}} is 
> used|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L166],
>  which is the same, but does not need resolving.
> {code}
> scala> { val t0 = System.currentTimeMillis; new 
> java.net.InetSocketAddress("0.0.0.0", 8000).getHostName; 
> System.currentTimeMillis - t0 }
> res0: Long = 5432
> scala> { val t0 = System.currentTimeMillis; new 
> java.net.InetSocketAddress(8000).getHostName; System.currentTimeMillis - t0 }
> res1: Long = 0
> {code}
> I'll send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13620) Avoid reverse DNS lookup for 0.0.0.0 on startup

2016-03-03 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos resolved SPARK-13620.

Resolution: Fixed

> Avoid reverse DNS lookup for 0.0.0.0 on startup
> ---
>
> Key: SPARK-13620
> URL: https://issues.apache.org/jira/browse/SPARK-13620
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Daniel Darabos
>Priority: Minor
> Fix For: 2.0.0
>
>
> I noticed we spend 5+ seconds during application startup with the following 
> stack trace:
> {code}
> at java.net.Inet6AddressImpl.getHostByAddr(Native Method)
> at java.net.InetAddress$1.getHostByAddr(InetAddress.java:926)
> at java.net.InetAddress.getHostFromNameService(InetAddress.java:611)
> at java.net.InetAddress.getHostName(InetAddress.java:553)
> at java.net.InetAddress.getHostName(InetAddress.java:525)
> at 
> java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
> at 
> java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
> at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
> at org.spark-project.jetty.server.Server.(Server.java:115)
> at 
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:243)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
> at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955)
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
> at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
> at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
> at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.SparkContext.(SparkContext.scala:481)
> {code}
> Spark wants to start a server on localhost. So it [creates an 
> {{InetSocketAddress}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L243]
>  [with hostname 
> {{"0.0.0.0"}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L136].
>  Spark passes in a hostname string, but Java [recognizes that it's actually 
> an 
> address|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L220]
>  and so sets the hostname to {{null}}. So when Jetty [calls 
> {{getHostName}}|https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java#L115]
>  Java has to do a reverse DNS lookup for {{0.0.0.0}}. That takes 5+ seconds 
> on my machine. Maybe it's just me? It's a very vanilla Ubuntu 14.04.
> There is a simple fix. Instead of passing in {{"0.0.0.0"}} we should not set 
> a hostname. In this case [{{InetAddress.anyLocalAddress()}} is 
> used|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L166],
>  which is the same, but does not need resolving.
> {code}
> scala> { val t0 = System.currentTimeMillis; new 
> java.net.InetSocketAddress("0.0.0.0", 8000).getHostName; 
> System.currentTimeMillis - t0 }
> res0: Long = 5432
> scala> { val t0 = System.currentTimeMillis; new 
> java.net.InetSocketAddress(8000).getHostName; System.currentTimeMillis - t0 }
> res1: Long = 0
> {code}
> I'll send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13620) Avoid reverse DNS lookup for 0.0.0.0 on startup

2016-03-03 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177784#comment-15177784
 ] 

Daniel Darabos commented on SPARK-13620:


I just tested with the latest 2.0 nightly. Starting {{spark-shell}} is 
subjectively fast. {{strace}} does not see the {{RESOLVE-ADDRESS}} request. 
Looks like it's fixed then! I should have checked {{master}} before filing an 
issue, sorry.

> Avoid reverse DNS lookup for 0.0.0.0 on startup
> ---
>
> Key: SPARK-13620
> URL: https://issues.apache.org/jira/browse/SPARK-13620
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Daniel Darabos
>Priority: Minor
> Fix For: 2.0.0
>
>
> I noticed we spend 5+ seconds during application startup with the following 
> stack trace:
> {code}
> at java.net.Inet6AddressImpl.getHostByAddr(Native Method)
> at java.net.InetAddress$1.getHostByAddr(InetAddress.java:926)
> at java.net.InetAddress.getHostFromNameService(InetAddress.java:611)
> at java.net.InetAddress.getHostName(InetAddress.java:553)
> at java.net.InetAddress.getHostName(InetAddress.java:525)
> at 
> java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
> at 
> java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
> at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
> at org.spark-project.jetty.server.Server.(Server.java:115)
> at 
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:243)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
> at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955)
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
> at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
> at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
> at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.SparkContext.(SparkContext.scala:481)
> {code}
> Spark wants to start a server on localhost. So it [creates an 
> {{InetSocketAddress}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L243]
>  [with hostname 
> {{"0.0.0.0"}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L136].
>  Spark passes in a hostname string, but Java [recognizes that it's actually 
> an 
> address|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L220]
>  and so sets the hostname to {{null}}. So when Jetty [calls 
> {{getHostName}}|https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java#L115]
>  Java has to do a reverse DNS lookup for {{0.0.0.0}}. That takes 5+ seconds 
> on my machine. Maybe it's just me? It's a very vanilla Ubuntu 14.04.
> There is a simple fix. Instead of passing in {{"0.0.0.0"}} we should not set 
> a hostname. In this case [{{InetAddress.anyLocalAddress()}} is 
> used|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L166],
>  which is the same, but does not need resolving.
> {code}
> scala> { val t0 = System.currentTimeMillis; new 
> java.net.InetSocketAddress("0.0.0.0", 8000).getHostName; 
> System.currentTimeMillis - t0 }
> res0: Long = 5432
> scala> { val t0 = System.currentTimeMillis; new 
> java.net.InetSocketAddress(8000).getHostName; System.currentTimeMillis - t0 }
> res1: Long = 0
> {code}
> I'll send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13620) Avoid reverse DNS lookup for 0.0.0.0 on startup

2016-03-03 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15177655#comment-15177655
 ] 

Daniel Darabos commented on SPARK-13620:


For the record I figured out a workaround. Strace shows that Java connects to 
{{/var/run/avahi-daemon/socket}}, sends {{RESOLVE-ADDRESS 0.0.0.0\n}} and times 
out reading. This can be confirmed by {{avahi-resolve-address 0.0.0.0}} which 
also times out after 5 seconds. The workaround is to edit {{/etc/avahi/hosts}} 
and add {{0.0.0.0 any.local}}. Saves me 5 seconds a million times a day! 
(Assuming I start a million Spark applications...)

> Avoid reverse DNS lookup for 0.0.0.0 on startup
> ---
>
> Key: SPARK-13620
> URL: https://issues.apache.org/jira/browse/SPARK-13620
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Daniel Darabos
>Priority: Minor
>
> I noticed we spend 5+ seconds during application startup with the following 
> stack trace:
> {code}
> at java.net.Inet6AddressImpl.getHostByAddr(Native Method)
> at java.net.InetAddress$1.getHostByAddr(InetAddress.java:926)
> at java.net.InetAddress.getHostFromNameService(InetAddress.java:611)
> at java.net.InetAddress.getHostName(InetAddress.java:553)
> at java.net.InetAddress.getHostName(InetAddress.java:525)
> at 
> java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
> at 
> java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
> at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
> at org.spark-project.jetty.server.Server.(Server.java:115)
> at 
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:243)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
> at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955)
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
> at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
> at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
> at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.SparkContext.(SparkContext.scala:481)
> {code}
> Spark wants to start a server on localhost. So it [creates an 
> {{InetSocketAddress}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L243]
>  [with hostname 
> {{"0.0.0.0"}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L136].
>  Spark passes in a hostname string, but Java [recognizes that it's actually 
> an 
> address|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L220]
>  and so sets the hostname to {{null}}. So when Jetty [calls 
> {{getHostName}}|https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java#L115]
>  Java has to do a reverse DNS lookup for {{0.0.0.0}}. That takes 5+ seconds 
> on my machine. Maybe it's just me? It's a very vanilla Ubuntu 14.04.
> There is a simple fix. Instead of passing in {{"0.0.0.0"}} we should not set 
> a hostname. In this case [{{InetAddress.anyLocalAddress()}} is 
> used|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L166],
>  which is the same, but does not need resolving.
> {code}
> scala> { val t0 = System.currentTimeMillis; new 
> java.net.InetSocketAddress("0.0.0.0", 8000).getHostName; 
> System.currentTimeMillis - t0 }
> res0: Long = 5432
> scala> { val t0 = System.currentTimeMillis; new 
> java.net.InetSocketAddress(8000).getHostName; System.currentTimeMillis - t0 }
> res1: Long = 0
> {code}
> I'll send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13620) Avoid reverse DNS lookup for 0.0.0.0 on startup

2016-03-02 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175974#comment-15175974
 ] 

Daniel Darabos commented on SPARK-13620:


Looks like the whole {{InetSocketAddress}} line has disappeared since Spark 
1.6.0 due to https://github.com/apache/spark/pull/10238. I guess this is most 
likely fixed then. I'll try building HEAD and see if I can confirm the fix.

> Avoid reverse DNS lookup for 0.0.0.0 on startup
> ---
>
> Key: SPARK-13620
> URL: https://issues.apache.org/jira/browse/SPARK-13620
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Daniel Darabos
>Priority: Minor
>
> I noticed we spend 5+ seconds during application startup with the following 
> stack trace:
> {code}
> at java.net.Inet6AddressImpl.getHostByAddr(Native Method)
> at java.net.InetAddress$1.getHostByAddr(InetAddress.java:926)
> at java.net.InetAddress.getHostFromNameService(InetAddress.java:611)
> at java.net.InetAddress.getHostName(InetAddress.java:553)
> at java.net.InetAddress.getHostName(InetAddress.java:525)
> at 
> java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
> at 
> java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
> at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
> at org.spark-project.jetty.server.Server.(Server.java:115)
> at 
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:243)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
> at 
> org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
> at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955)
> at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
> at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
> at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
> at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
> at scala.Option.foreach(Option.scala:236)
> at org.apache.spark.SparkContext.(SparkContext.scala:481)
> {code}
> Spark wants to start a server on localhost. So it [creates an 
> {{InetSocketAddress}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L243]
>  [with hostname 
> {{"0.0.0.0"}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L136].
>  Spark passes in a hostname string, but Java [recognizes that it's actually 
> an 
> address|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L220]
>  and so sets the hostname to {{null}}. So when Jetty [calls 
> {{getHostName}}|https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java#L115]
>  Java has to do a reverse DNS lookup for {{0.0.0.0}}. That takes 5+ seconds 
> on my machine. Maybe it's just me? It's a very vanilla Ubuntu 14.04.
> There is a simple fix. Instead of passing in {{"0.0.0.0"}} we should not set 
> a hostname. In this case [{{InetAddress.anyLocalAddress()}} is 
> used|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L166],
>  which is the same, but does not need resolving.
> {code}
> scala> { val t0 = System.currentTimeMillis; new 
> java.net.InetSocketAddress("0.0.0.0", 8000).getHostName; 
> System.currentTimeMillis - t0 }
> res0: Long = 5432
> scala> { val t0 = System.currentTimeMillis; new 
> java.net.InetSocketAddress(8000).getHostName; System.currentTimeMillis - t0 }
> res1: Long = 0
> {code}
> I'll send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13620) Avoid reverse DNS lookup for 0.0.0.0 on startup

2016-03-02 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-13620:
---
Description: 
I noticed we spend 5+ seconds during application startup with the following 
stack trace:

{code}
at java.net.Inet6AddressImpl.getHostByAddr(Native Method)
at java.net.InetAddress$1.getHostByAddr(InetAddress.java:926)
at java.net.InetAddress.getHostFromNameService(InetAddress.java:611)
at java.net.InetAddress.getHostName(InetAddress.java:553)
at java.net.InetAddress.getHostName(InetAddress.java:525)
at 
java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
at 
java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
at org.spark-project.jetty.server.Server.(Server.java:115)
at 
org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:243)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955)
at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
at 
org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at 
org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.(SparkContext.scala:481)
{code}

Spark wants to start a server on localhost. So it [creates an 
{{InetSocketAddress}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L243]
 [with hostname 
{{"0.0.0.0"}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L136].
 Spark passes in a hostname string, but Java [recognizes that it's actually an 
address|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L220]
 and so sets the hostname to {{null}}. So when Jetty [calls 
{{getHostName}}|https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java#L115]
 Java has to do a reverse DNS lookup for {{0.0.0.0}}. That takes 5+ seconds on 
my machine. Maybe it's just me? It's a very vanilla Ubuntu 14.04.

There is a simple fix. Instead of passing in {{"0.0.0.0"}} we should not set a 
hostname. In this case [{{InetAddress.anyLocalAddress()}} is 
used|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L166],
 which is the same, but does not need resolving.

{code}
scala> { val t0 = System.currentTimeMillis; new 
java.net.InetSocketAddress("0.0.0.0", 8000).getHostName; 
System.currentTimeMillis - t0 }
res0: Long = 5432

scala> { val t0 = System.currentTimeMillis; new 
java.net.InetSocketAddress(8000).getHostName; System.currentTimeMillis - t0 }
res1: Long = 0
{code}

I'll send a pull request for this.

  was:
I noticed we spend 5+ seconds during application startup with the following 
stack trace:

{code}
at java.net.Inet6AddressImpl.getHostByAddr(Native Method)
at java.net.InetAddress$1.getHostByAddr(InetAddress.java:926)
at java.net.InetAddress.getHostFromNameService(InetAddress.java:611)
at java.net.InetAddress.getHostName(InetAddress.java:553)
at java.net.InetAddress.getHostName(InetAddress.java:525)
at 
java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
at 
java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
at org.spark-project.jetty.server.Server.(Server.java:115)
at 
org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:243)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955)
at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
at 

[jira] [Updated] (SPARK-13620) Avoid reverse DNS lookup for 0.0.0.0 on startup

2016-03-02 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-13620:
---
Description: 
I noticed we spend 5+ seconds during application startup with the following 
stack trace:

{code}
at java.net.Inet6AddressImpl.getHostByAddr(Native Method)
at java.net.InetAddress$1.getHostByAddr(InetAddress.java:926)
at java.net.InetAddress.getHostFromNameService(InetAddress.java:611)
at java.net.InetAddress.getHostName(InetAddress.java:553)
at java.net.InetAddress.getHostName(InetAddress.java:525)
at 
java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
at 
java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
at org.spark-project.jetty.server.Server.(Server.java:115)
at 
org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:243)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955)
at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
at 
org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at 
org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.(SparkContext.scala:481)
{code}

Spark wants to start a server on localhost. So it [creates an 
{{InetSocketAddress}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L243]
 [with hostname 
{{"0.0.0.0"}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L136].
 Spark passes in a hostname string, but Java [recognizes that it's actually an 
address|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L220]
 and so sets the hostname to {{null}}. So when Jetty [calls 
{{getHostName}}|https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java#L115]
 Java has to do a reverse DNS lookup for {{0.0.0.0}}. That takes 5+ seconds on 
my machine. Maybe it's just me? It's a very vanilla Ubuntu 14.04.

There is a simple fix. Instead of passing in {{"0.0.0.0"}} we should not set a 
hostname. In this case 
[{{InetAddress.anyLocalAddress()}}|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L166
 is used], which is the same, but does not need resolving.

{code}
scala> { val t0 = System.currentTimeMillis; new 
java.net.InetSocketAddress("0.0.0.0", 8000).getHostName; 
System.currentTimeMillis - t0 }
res0: Long = 5432

scala> { val t0 = System.currentTimeMillis; new 
java.net.InetSocketAddress(8000).getHostName; System.currentTimeMillis - t0 }
res1: Long = 0
{code}

I'll send a pull request for this.

  was:
I noticed we spend 5+ seconds during application startup with the following 
stack trace:

{code}
at java.net.Inet6AddressImpl.getHostByAddr(Native Method)
at java.net.InetAddress$1.getHostByAddr(InetAddress.java:926)
at java.net.InetAddress.getHostFromNameService(InetAddress.java:611)
at java.net.InetAddress.getHostName(InetAddress.java:553)
at java.net.InetAddress.getHostName(InetAddress.java:525)
at 
java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
at 
java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
at org.spark-project.jetty.server.Server.(Server.java:115)
at 
org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:243)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955)
at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
at 

[jira] [Created] (SPARK-13620) Avoid reverse DNS lookup for 0.0.0.0 on startup

2016-03-02 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-13620:
--

 Summary: Avoid reverse DNS lookup for 0.0.0.0 on startup
 Key: SPARK-13620
 URL: https://issues.apache.org/jira/browse/SPARK-13620
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.6.0
Reporter: Daniel Darabos
Priority: Minor


I noticed we spend 5+ seconds during application startup with the following 
stack trace:

{code}
at java.net.Inet6AddressImpl.getHostByAddr(Native Method)
at java.net.InetAddress$1.getHostByAddr(InetAddress.java:926)
at java.net.InetAddress.getHostFromNameService(InetAddress.java:611)
at java.net.InetAddress.getHostName(InetAddress.java:553)
at java.net.InetAddress.getHostName(InetAddress.java:525)
at 
java.net.InetSocketAddress$InetSocketAddressHolder.getHostName(InetSocketAddress.java:82)
at 
java.net.InetSocketAddress$InetSocketAddressHolder.access$600(InetSocketAddress.java:56)
at java.net.InetSocketAddress.getHostName(InetSocketAddress.java:345)
at org.spark-project.jetty.server.Server.(Server.java:115)
at 
org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:243)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1964)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1955)
at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
at 
org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at 
org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.(SparkContext.scala:481)
{code}

Spark wants to start a server on localhost. So it [creates an 
{{InetSocketAddress}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/JettyUtils.scala#L243]
 [with hostname 
{{"0.0.0.0"}}|https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/ui/WebUI.scala#L136].
 Spark passes in a hostname string, but Java [recognizes that it's actually an 
address|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L220]
 and so sets the hostname to {{null}}. So when Jetty [calls 
{{getHostName}}|https://github.com/eclipse/jetty.project/blob/jetty-8.1.14.v20131031/jetty-server/src/main/java/org/eclipse/jetty/server/Server.java#L115]
 Java has to do a reverse DNS lookup for {{0.0.0.0}}. That takes 5+ seconds on 
my machine. Maybe it's just me? It's a very vanilla Ubuntu 14.04.

There is a simple fix. Instead of passing in {{"0.0.0.0"}} we should not set a 
hostname. In this case 
[{{InetAddress.anyLocalAddress()}}|https://github.com/openjdk-mirror/jdk/blob/adea42765ae4e7117c3f0e2d618d5e6aed44ced2/src/share/classes/java/net/InetSocketAddress.java#L166]
 is used, which is the same, but does not need resolving.

{code}
scala> { val t0 = System.currentTimeMillis; new 
java.net.InetSocketAddress("0.0.0.0", 8000).getHostName; 
System.currentTimeMillis - t0 }
res0: Long = 5432

scala> { val t0 = System.currentTimeMillis; new 
java.net.InetSocketAddress(8000).getHostName; System.currentTimeMillis - t0 }
res1: Long = 0
{code}

I'll send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10057) Faill to load class org.slf4j.impl.StaticLoggerBinder

2016-03-01 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos reopened SPARK-10057:


Thanks for the repro, Stephen! It also works in Spark 1.6.0.

This was mentioned in https://github.com/apache/spark/pull/8196 but ultimately 
I think the conclusion there was that it was not due to that change.

> Faill to load class org.slf4j.impl.StaticLoggerBinder
> -
>
> Key: SPARK-10057
> URL: https://issues.apache.org/jira/browse/SPARK-10057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Davies Liu
>
> Some loggings are dropped, because it can't load class 
> "org.slf4j.impl.StaticLoggerBinder"
> {code}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10057) Faill to load class org.slf4j.impl.StaticLoggerBinder

2016-03-01 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-10057:
---
Affects Version/s: 1.6.0

> Faill to load class org.slf4j.impl.StaticLoggerBinder
> -
>
> Key: SPARK-10057
> URL: https://issues.apache.org/jira/browse/SPARK-10057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Davies Liu
>
> Some loggings are dropped, because it can't load class 
> "org.slf4j.impl.StaticLoggerBinder"
> {code}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13212) Provide a way to unregister data sources from a SQLContext

2016-02-05 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-13212:
--

 Summary: Provide a way to unregister data sources from a SQLContext
 Key: SPARK-13212
 URL: https://issues.apache.org/jira/browse/SPARK-13212
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Daniel Darabos


We allow our users to run SQL queries on their data via a web interface. We 
create an isolated SQLContext with {{sqlContext.newSession()}}, create their 
DataFrames in this context, register them with {{registerTempTable}}, then 
execute the query with {{isolatedContext.sql(query)}}.

The issue is that they have the full power of Spark SQL at their disposal. They 
can run {{SELECT * FROM csv.`/etc/passwd`}}. This specific syntax can be 
disabled by setting {{spark.sql.runSQLOnFiles}} (a private, undocumented 
configuration) to {{false}}. But creating a temporary table 
(http://spark.apache.org/docs/latest/sql-programming-guide.html#loading-data-programmatically)
 would still work, if we had a HiveContext.

As long as all DataSources on the classpath are readily available, I don't 
think we can be reassured about the security implications. So I think a nice 
solution would be to make the list of available DataSources a property of the 
SQLContext. Then for the isolated SQLContext we could simply remove all 
DataSources. This would allow more fine-grained use cases too.

What do you think?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2016-02-03 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130351#comment-15130351
 ] 

Daniel Darabos commented on SPARK-1239:
---

I've read an interesting article about the "Kylix" butterfly allreduce 
(http://www.cs.berkeley.edu/~jfc/papers/14/Kylix.pdf). I think this is a direct 
solution to this problem and the authors say integration with Spark should be 
"easy".

Perhaps the same approach could be simulated within the current Spark shuffle 
implementation. I think the idea is to break up the M*R shuffle into an M*K and 
a K*R shuffle, where K is much less then M or R. So those K partitions will be 
large, but that should be fine.

> Don't fetch all map output statuses at each reducer during shuffles
> ---
>
> Key: SPARK-1239
> URL: https://issues.apache.org/jira/browse/SPARK-1239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Patrick Wendell
>Assignee: Thomas Graves
>
> Instead we should modify the way we fetch map output statuses to take both a 
> mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12964) SparkContext.localProperties leaked

2016-01-22 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-12964:
---
Description: 
I have a non-deterministic but quite reliable reproduction for a case where 
{{spark.sql.execution.id}} is leaked. Operations then die with 
{{spark.sql.execution.id is already set}}. These threads never recover because 
there is nothing to unset {{spark.sql.execution.id}}. (It's not a case of 
nested {{withNewExecutionId}} calls.)

I have figured out why this happens. We are within a {{withNewExecutionId}} 
block. At some point we call back to user code. (In our case this is a custom 
data source's {{buildScan}} method.) The user code calls 
{{scala.concurrent.Await.result}}. Because our thread is a member of a 
{{ForkJoinPool}} (this is a Play HTTP serving thread) {{Await.result}} starts a 
new thread. {{SparkContext.localProperties}} is cloned for this thread and then 
it's ready to serve an HTTP request.

The first thread then finishes waiting, finishes {{buildScan}}, and leaves 
{{withNewExecutionId}}, clearing {{spark.sql.execution.id}} in the {{finally} 
block. All good. But some time later another HTTP request will be served by the 
second thread. This thread is "poisoned" with a {{spark.sql.execution.id}}. 
When it tries to use {{withNewExecutionId}} it fails.



I don't know who's at fault here. 

 - I don't like the {{ThreadLocal}} properties anyway. Why not create an 
Execution object and let it wrap the operation? Then you could have two 
executions in parallel on the same thread, and other stuff like that. It would 
be much clearer than storing the execution ID in a kind-of-global variable.
 - Why do we have to inherit the {{ThreadLocal}} properties? I'm sure there is 
a good reason, but this is essentially a bug-generator in my view. (It has 
already generated https://issues.apache.org/jira/browse/SPARK-10563.)
 - {{Await.result}} --- I never would have thought it starts threads.
 - We probably shouldn't be calling {{Await.result}} inside {{buildScan}}.
 - We probably shouldn't call Spark things from HTTP serving threads.

I'm not sure what could be done on the Spark side, but I thought I should 
mention this interesting issue. For supporting evidence here is the stack trace 
when {{localProperties}} is getting cloned. It's contents at that point are:

{noformat}
{spark.sql.execution.id=0, spark.rdd.scope.noOverride=true, 
spark.rdd.scope={"id":"4","name":"ExecutedCommand"}}
{noformat}

{noformat}
  at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:364) 
[spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0]
  at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:362) 
[spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0]
  at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:353) 
[na:1.7.0_91]
  at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:261) 
[na:1.7.0_91]
  at java.lang.ThreadLocal.createInheritedMap(ThreadLocal.java:236) 
[na:1.7.0_91]   
  at java.lang.Thread.init(Thread.java:416) [na:1.7.0_91]   

  at java.lang.Thread.init(Thread.java:349) [na:1.7.0_91]   

  at java.lang.Thread.(Thread.java:508) [na:1.7.0_91] 

  at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.(ForkJoinWorkerThread.java:48)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.(ExecutionContextImpl.scala:42)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory.newThread(ExecutionContextImpl.scala:42)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.forkjoin.ForkJoinPool.tryCompensate(ForkJoinPool.java:2341) 
[org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3638) 
[org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.blockOn(ExecutionContextImpl.scala:45)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at scala.concurrent.Await$.result(package.scala:107) 
[org.scala-lang.scala-library-2.10.5.jar:na] 
  at 
com.lynxanalytics.biggraph.graph_api.SafeFuture.awaitResult(SafeFuture.scala:50)
 [biggraph.jar]
  at 
com.lynxanalytics.biggraph.graph_api.DataManager.get(DataManager.scala:315) 
[biggraph.jar] 
  at 
com.lynxanalytics.biggraph.graph_api.Scripting$.getData(Scripting.scala:87) 
[biggraph.jar] 
  at 
com.lynxanalytics.biggraph.table.TableRelation$$anonfun$1.apply(DefaultSource.scala:46)
 [biggraph.jar]
  at 
com.lynxanalytics.biggraph.table.TableRelation$$anonfun$1.apply(DefaultSource.scala:46)
 [biggraph.jar]
  at 

[jira] [Created] (SPARK-12964) SparkContext.localProperties leaked

2016-01-22 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-12964:
--

 Summary: SparkContext.localProperties leaked
 Key: SPARK-12964
 URL: https://issues.apache.org/jira/browse/SPARK-12964
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Daniel Darabos
Priority: Minor


I have a non-deterministic but quite reliable reproduction for a case where 
{{spark.sql.execution.id}} is leaked. Operations then die with 
{{spark.sql.execution.id is already set}}. These threads never recover because 
there is nothing to unset {{spark.sql.execution.id}}. (It's not a case of 
nested {{withNewExecutionId}} calls.)

I have figured out why this happens. We are within a {{withNewExecutionId}} 
block. At some point we call back to user code. (In our case this is a custom 
data source's {{buildScan}} method.) The user code calls 
{{scala.concurrent.Await.result}}. Because our thread is a member of a 
{{ForkJoinPool}} (this is a Play HTTP serving thread) {{Await.result}} starts a 
new thread. {{SparkContext.localProperties}} is cloned for this thread and then 
it's ready to serve an HTTP request.

The first thread then finishes waiting, finishes {{buildScan}}, and leaves 
{{withNewExecutionId}}, clearing {{spark.sql.execution.id}} in the {{finally} 
block. All good. But some time later another HTTP request will be served by the 
second thread. This thread is "poisoned" with a {{spark.sql.execution.id}}. 
When it tries to use {{withNewExecutionId}} it fails.



I don't know who's at fault here. 

 - I don't like the {{ThreadLocal}} properties anyway. Why not create an 
Execution object and let it wrap the operation? Then you could have two 
executions in parallel on the same thread, and other stuff like that. It would 
be much clearer than storing the execution ID in a kind-of-global variable.
 - Why do we have to inherit the {{ThreadLocal}} properties? I'm sure there is 
a good reason, but this is essentially a bug-generator in my view. (It has 
already generated https://issues.apache.org/jira/browse/SPARK-10563.)
 - {{Await.result}} --- I never would have thought it starts threads.
 - We probably shouldn't be calling {{Await.result}} inside {{buildScan}}.
 - We probably shouldn't call Spark things from HTTP serving threads.

I'm not sure what could be done on the Spark side, but I thought I should 
mention this interesting issue. For supporting evidence here is the stack trace 
when {{localProperties}} is getting cloned. It's contents at that point are: {{ 
spark.sql.execution.id=0, spark.rdd.scope.noOverride=true, 
spark.rdd.scope={"id":"4","name":"ExecutedCommand"} }}

{noformat}
  at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:364) 
[spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0]
  at org.apache.spark.SparkContext$$anon$2.childValue(SparkContext.scala:362) 
[spark-assembly-1.6.0-hadoop2.4.0.jar:1.6.0]
  at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:353) 
[na:1.7.0_91]
  at java.lang.ThreadLocal$ThreadLocalMap.(ThreadLocal.java:261) 
[na:1.7.0_91]
  at java.lang.ThreadLocal.createInheritedMap(ThreadLocal.java:236) 
[na:1.7.0_91]   
  at java.lang.Thread.init(Thread.java:416) [na:1.7.0_91]   

  at java.lang.Thread.init(Thread.java:349) [na:1.7.0_91]   

  at java.lang.Thread.(Thread.java:508) [na:1.7.0_91] 

  at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.(ForkJoinWorkerThread.java:48)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.(ExecutionContextImpl.scala:42)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory.newThread(ExecutionContextImpl.scala:42)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.forkjoin.ForkJoinPool.tryCompensate(ForkJoinPool.java:2341) 
[org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3638) 
[org.scala-lang.scala-library-2.10.5.jar:na]
  at 
scala.concurrent.impl.ExecutionContextImpl$DefaultThreadFactory$$anon$2.blockOn(ExecutionContextImpl.scala:45)
 [org.scala-lang.scala-library-2.10.5.jar:na]
  at scala.concurrent.Await$.result(package.scala:107) 
[org.scala-lang.scala-library-2.10.5.jar:na] 
  at 
com.lynxanalytics.biggraph.graph_api.SafeFuture.awaitResult(SafeFuture.scala:50)
 [biggraph.jar]
  at 
com.lynxanalytics.biggraph.graph_api.DataManager.get(DataManager.scala:315) 
[biggraph.jar] 
  at 
com.lynxanalytics.biggraph.graph_api.Scripting$.getData(Scripting.scala:87) 
[biggraph.jar] 
  at 
com.lynxanalytics.biggraph.table.TableRelation$$anonfun$1.apply(DefaultSource.scala:46)
 [biggraph.jar]
  at 

[jira] [Commented] (SPARK-2309) Generalize the binary logistic regression into multinomial logistic regression

2016-01-21 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110628#comment-15110628
 ] 

Daniel Darabos commented on SPARK-2309:
---

https://github.com/apache/spark/blob/v1.6.0/docs/ml-classification-regression.md#logistic-regression
 still says:

> The current implementation of logistic regression in spark.ml only supports 
> binary classes. Support for multiclass regression will be added in the future.

That can be removed now, right?

> Generalize the binary logistic regression into multinomial logistic regression
> --
>
> Key: SPARK-2309
> URL: https://issues.apache.org/jira/browse/SPARK-2309
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Critical
> Fix For: 1.3.0
>
>
> Currently, there is no multi-class classifier in mllib. Logistic regression 
> can be extended to multinomial one straightforwardly. 
> The following formula will be implemented. 
> http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory

2016-01-14 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097987#comment-15097987
 ] 

Daniel Darabos commented on SPARK-11293:


> so should be reopened or not? is there still a memory leak? is there a new 
> memory leak instead of the old one?

I think it should be reopened, because the remaining leak is an edge case of 
the original problem that was not covered by Josh's fix. I'll reopen it and 
we'll see what he thinks!

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11293) Spillable collections leak shuffle memory

2016-01-14 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos reopened SPARK-11293:


> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11293) Spillable collections leak shuffle memory

2016-01-14 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-11293:
---
Affects Version/s: 1.6.0

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory

2016-01-14 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098011#comment-15098011
 ] 

Daniel Darabos commented on SPARK-11293:


> so add 1.6.0 as affected version...

Done.

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory

2016-01-05 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083311#comment-15083311
 ] 

Daniel Darabos commented on SPARK-11293:


I have a somewhat contrived example that still leaks in 1.6.0. I started 
{{spark-shell --master 'local-cluster[2,2,1024]'}} and ran:

{code}
sc.parallelize(0 to 1000, 2).map(x => x % 1 -> 
x).groupByKey.asInstanceOf[org.apache.spark.rdd.ShuffledRDD[Int, Int, 
Iterable[Int]]].setKeyOrdering(implicitly[Ordering[Int]]).mapPartitions { it => 
it.take(1) }.collect
{code}

I've added extra logging around task memory acquisition so I would be able to 
see what is not released. These are the logs:

{code}
16/01/05 17:02:45 INFO Executor: Running task 0.0 in stage 13.0 (TID 24)
16/01/05 17:02:45 INFO MapOutputTrackerWorker: Updating epoch to 7 and clearing 
cache
16/01/05 17:02:45 INFO TorrentBroadcast: Started reading broadcast variable 13
16/01/05 17:02:45 INFO MemoryStore: Block broadcast_13_piece0 stored as bytes 
in memory (estimated size 2.3 KB, free 7.6 KB)
16/01/05 17:02:45 INFO TorrentBroadcast: Reading broadcast variable 13 took 6 ms
16/01/05 17:02:45 INFO MemoryStore: Block broadcast_13 stored as values in 
memory (estimated size 4.5 KB, free 12.1 KB)
16/01/05 17:02:45 INFO MapOutputTrackerWorker: Don't have map outputs for 
shuffle 6, fetching them
16/01/05 17:02:45 INFO MapOutputTrackerWorker: Doing the fetch; tracker 
endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@192.168.0.32:55147)
16/01/05 17:02:45 INFO MapOutputTrackerWorker: Got the output locations
16/01/05 17:02:45 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks 
out of 2 blocks
16/01/05 17:02:45 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 
1 ms
16/01/05 17:02:45 ERROR TaskMemoryManager: Task 24 acquire 5.0 MB for null
16/01/05 17:02:45 ERROR TaskMemoryManager: Stack trace:
java.lang.Exception: here
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:187)
at 
org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:82)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:55)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:158)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:45)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:89)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/01/05 17:02:47 ERROR TaskMemoryManager: Task 24 acquire 15.0 MB for null
16/01/05 17:02:47 ERROR TaskMemoryManager: Stack trace:
java.lang.Exception: here
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:187)
at 
org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:82)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:55)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:158)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:45)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:89)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 

[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory

2016-01-05 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083326#comment-15083326
 ] 

Daniel Darabos commented on SPARK-11293:


Sorry, my example was overly complicated. This one triggers the same leak.

{code}
sc.parallelize(0 to 1000, 2).map(x => x % 1 -> 
x).groupByKey.mapPartitions { it => it.take(1) }.collect
{code}

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11652) Remote code execution with InvokerTransformer

2015-11-13 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15003982#comment-15003982
 ] 

Daniel Darabos commented on SPARK-11652:


> I may be missing some point, but Spark isn't consuming serialized data from 
> untrusted sources in general, right? The risk here is way down the list of 
> risks if untrusted sources are sending closures to your cluster.

I was thinking of seemingly harmless things like executor heartbeats and stuff 
like that. I've never looked at how authentication is implemented. All 
communications on all ports are authenticated? If so, then I don't see any way 
to exploit this either. Sorry for the noise then.

> Remote code execution with InvokerTransformer
> -
>
> Key: SPARK-11652
> URL: https://issues.apache.org/jira/browse/SPARK-11652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Daniel Darabos
>Priority: Minor
>
> There is a remote code execution vulnerability in the Apache Commons 
> collections library (https://issues.apache.org/jira/browse/COLLECTIONS-580) 
> that can be exploited simply by causing malicious data to be deserialized 
> using Java serialization.
> As Spark is used in security-conscious environments I think it's worth taking 
> a closer look at how the vulnerability affects Spark. What are the points 
> where Spark deserializes external data? Which are affected by using Kryo 
> instead of Java serialization? What mitigation strategies are available?
> If the issue is serious enough but mitigation is possible, it may be useful 
> to post about it on the mailing list or blog.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11652) Remote code execution with InvokerTransformer

2015-11-11 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-11652:
--

 Summary: Remote code execution with InvokerTransformer
 Key: SPARK-11652
 URL: https://issues.apache.org/jira/browse/SPARK-11652
 Project: Spark
  Issue Type: Bug
Reporter: Daniel Darabos
Priority: Minor


There is a remote code execution vulnerability in the Apache Commons 
collections library (https://issues.apache.org/jira/browse/COLLECTIONS-580) 
that can be exploited simply by causing malicious data to be deserialized using 
Java serialization.

As Spark is used in security-conscious environments I think it's worth taking a 
closer look at how the vulnerability affects Spark. What are the points where 
Spark deserializes external data? Which are affected by using Kryo instead of 
Java serialization? What mitigation strategies are available?

If the issue is serious enough but mitigation is possible, it may be useful to 
post about it on the mailing list or blog.

Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2015-11-03 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987046#comment-14987046
 ] 

Daniel Darabos commented on SPARK-1239:
---

I can also add some data. I have a ShuffleMapStage with 82,714 tasks and then a 
ResultStage with 222,609 tasks. The driver cannot serialize this:

{noformat}
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271) ~[na:1.7.0_79]
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) 
~[na:1.7.0_79]
at 
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) 
~[na:1.7.0_79]
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) 
~[na:1.7.0_79]
at 
java.util.zip.DeflaterOutputStream.deflate(DeflaterOutputStream.java:253) 
~[na:1.7.0_79]
at 
java.util.zip.DeflaterOutputStream.write(DeflaterOutputStream.java:211) 
~[na:1.7.0_79]
at java.util.zip.GZIPOutputStream.write(GZIPOutputStream.java:146) 
~[na:1.7.0_79]
at 
java.io.ObjectOutputStream$BlockDataOutputStream.writeBlockHeader(ObjectOutputStream.java:1893)
 ~[na:1.7.0_79]
at 
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1874)
 ~[na:1.7.0_79]
at 
java.io.ObjectOutputStream$BlockDataOutputStream.flush(ObjectOutputStream.java:1821)
 ~[na:1.7.0_79]
at java.io.ObjectOutputStream.flush(ObjectOutputStream.java:718) 
~[na:1.7.0_79]
at java.io.ObjectOutputStream.close(ObjectOutputStream.java:739) 
~[na:1.7.0_79]
at 
org.apache.spark.MapOutputTracker$$anonfun$serializeMapStatuses$2.apply$mcV$sp(MapOutputTracker.scala:362)
 ~[spark-assembly-1.4.0-hadoop2.4.0.jar:1.4.0]
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1294) 
~[spark-assembly-1.4.0-hadoop2.4.0.jar:1.4.0]
at 
org.apache.spark.MapOutputTracker$.serializeMapStatuses(MapOutputTracker.scala:361)
 ~[spark-assembly-1.4.0-hadoop2.4.0.jar:1.4.0]
at 
org.apache.spark.MapOutputTrackerMaster.getSerializedMapOutputStatuses(MapOutputTracker.scala:312)
 ~[spark-assembly-1.4.0-hadoop2.4.0.jar:1.4.0]
at 
org.apache.spark.MapOutputTrackerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(MapOutputTracker.scala:49)
 ~[spark-assembly-1.4.0-hadoop2.4.0.jar:1.4.0]
{noformat}

I see {{getSerializedMapOutputStatuses}} has changed a lot since 1.4.0 but it 
still returns an array sized proportional to _M * R_. How can this be part of a 
scalable system? How is this not a major issue for everyone? Am I doing 
something wrong?

I'm now thinking that maybe if you have an overwhelming majority of empty or 
non-empty blocks, the bitmap will compress very well. But it's possible that I 
am ending up with a relatively even mix of empty and non-empty blocks, killing 
the compression. I have about 40 billion lines, _M * R_ is about 20 billion, so 
this seems plausible.

It's also possible that I should have larger partitions. Due to the processing 
I do it's not possible -- it leads to the executors OOMing. But larger 
partitions would not be a scalable solution anyway. If _M_ and _R_ are 
reasonable now with some number of lines per partition, then when your data 
size doubles they will also double and _M * R_ will quadruple. At some point 
the number of lines per map output will be low enough that compression becomes 
ineffective.

I see https://issues.apache.org/jira/browse/SPARK-11271 has recently decreased 
the map status size by 20%. That means in Spark 1.6 I will be able to process 
1/sqrt(0.8) or 12% more data than now. The way I understand the situation the 
improvement required is orders of magnitude larger than that. I'm currently 
hitting this issue with 5 TB of input. If I tried processing 5 PB, the map 
status would be a million times larger.

I like the premise of this JIRA ticket of not building the map status table in 
the first place. But a colleague of mine asks if perhaps we could even avoid 
tracking this data in the driver. If the driver just provided the reducers with 
the list of mappers they could each just ask the mappers directly for the list 
of blocks they should fetch.

> Don't fetch all map output statuses at each reducer during shuffles
> ---
>
> Key: SPARK-1239
> URL: https://issues.apache.org/jira/browse/SPARK-1239
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Patrick Wendell
>
> Instead we should modify the way we fetch map output statuses to take both a 
> mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Created] (SPARK-11403) Log something when dying from OnOutOfMemoryError

2015-10-29 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-11403:
--

 Summary: Log something when dying from OnOutOfMemoryError
 Key: SPARK-11403
 URL: https://issues.apache.org/jira/browse/SPARK-11403
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.5.1
Reporter: Daniel Darabos
Priority: Trivial


Executors on YARN run with the {{-XX:OnOutOfMemoryError='kill %p'}} flag. The 
motivation I think is to avoid getting to an unpredictable state where some 
threads may have been lost.

The problem is that when this happens nothing is logged. The executor does not 
log anything, it just exits with exit code 143. This is logged in the 
NodeManager log.

I'd like to add a tiny log message to make debugging such issues a bit easier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10792) Spark + YARN – executor is not re-created

2015-10-06 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos closed SPARK-10792.
--
Resolution: Duplicate

> Spark + YARN – executor is not re-created
> -
>
> Key: SPARK-10792
> URL: https://issues.apache.org/jira/browse/SPARK-10792
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.4.0
> Environment: - centos7 deployed on AWS
> - yarn / hadoop 2.6.0-cdh5.4.2
> - spark 1.4.0 compiled with hadoop 2.6
>Reporter: Adrian Tanase
>Priority: Critical
> Attachments: Screen Shot 2015-09-21 at 1.58.28 PM.png
>
>
> We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
> stateful app that reads from kafka (with the new direct API) and we’re 
> checkpointing to HDFS.
> During some resilience testing, we restarted one of the machines and brought 
> it back online. During the offline period, the Yarn cluster would not have 
> resources to re-create the missing executor.
> After starting all the services on the machine, it correctly joined the Yarn 
> cluster, however the spark streaming app does not seem to notice that the 
> resources are back and has not re-created the missing executor.
> The app is correctly running with 6 out of 7 executors, however it’s running 
> under capacity.
> If we manually kill the driver and re-submit the app to yarn, all the sate is 
> correctly recreated from checkpoint and all 7 executors are now online – 
> however this seems like a brutal workaround.
> *Scenarios tested to isolate the issue:*
> The expected outcome after a machine reboot + services back is that 
> processing continues on it. *FAILED* below means that processing continues in 
> a reduced capacity, as the machine lost rarely re-joins as container/executor 
> even if YARN sees it as healthy node.
> || No || Failure scenario || test result || data loss || Notes ||
> | 1  | Single node restart | FAILED | NO | Executor NOT redeployed when 
> machine comes back and services are restarted |
> | 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
> restoring services on machines that are down, the app OR kafka OR zookeeper 
> metadata gets corrupted, app crashes and can't be restarted w/o clearing 
> checkpoint -> dataloss. Root cause is unhealthy cluster when too many 
> machines are lost. |
> | 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
> restart, driver does not crash |
> | 4  | Graceful services restart | FAILED | NO | Behaves just like single 
> node restart even if we take the time to manually stop services before 
> machine reboot. |
> | 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
> will usually start even if YARN can't fullfill all the resource requests 
> (e.g. 5 out of 7 nodes are up when app is started). However, when the nodes 
> are added to YARN, we see that Spark deploys executors on them, as expected 
> in all the scenarios. |
> | 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts 
> it behaves like machine restart - the rest work as expected, 
> container/executor are redeployed in a matter of seconds |
> | 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
> validate if the behavior is caused by maxing out the cluster and having no 
> slack to redeploy a crashed node. We are still behaving like single node 
> restart event with lots of extra capacity in YARN - nodes, cores and RAM. |
> *Logs for Scenario 6 – correct behavior on process restart*
> {noformat}
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
> container_1442827158253_0004_01_04 (state: COMPLETE, exit status: 137)
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
> container_1442827158253_0004_01_04. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal
> ..
> (logical continuation from earlier restart attempt)
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
> containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container request (host: Any, 
> capability: )
> ..
> 2015-09-21 10:33:25,663 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Launching container 
> container_1442827158253_0004_01_12 for on host ip-10-0-1-16.ec2.internal
> 2015-09-21 10:33:25,664 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Launching ExecutorRunnable. 
> 

[jira] [Comment Edited] (SPARK-10792) Spark + YARN – executor is not re-created

2015-10-05 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943405#comment-14943405
 ] 

Daniel Darabos edited comment on SPARK-10792 at 10/5/15 2:17 PM:
-

I work with Andras and can now substantiate his account with logs! We run on 
our own hardware on YARN. We start with 5 executors and then lose two.

Application master logs:

{noformat}
15/10/02 13:01:51 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to 
kill executor(s) .
15/10/02 22:14:51 INFO yarn.YarnAllocator: Driver requested a total number of 4 
executor(s).
15/10/02 22:14:51 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to 
kill executor(s) 7.
15/10/02 23:17:51 INFO yarn.YarnAllocator: Driver requested a total number of 3 
executor(s).
15/10/02 23:17:51 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to 
kill executor(s) 3.
{noformat}

Driver logs:

{noformat}
W2015-10-02 22:14:51,159 
HeartbeatReceiver:[sparkDriver-akka.actor.default-dispatcher-19] Removing 
executor 7 with no recent heartbeats: 178981 ms exceeds timeout 12 ms
E2015-10-02 22:14:51,159 
YarnScheduler:[sparkDriver-akka.actor.default-dispatcher-19] Lost executor 7 on 
lynx1: Executor heartbeat timed out after 178981 ms
I2015-10-02 22:14:51,161 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-19] Re-queueing tasks 
for 7 from TaskSet 7378.0
W2015-10-02 22:14:51,161 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-19] Lost task 6.0 in 
stage 7378.0 (TID 14240, lynx1): ExecutorLostFailure (executor 7 lost)
W2015-10-02 22:14:51,161 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-19] Lost task 4.0 in 
stage 7378.0 (TID 14236, lynx1): ExecutorLostFailure (executor 7 lost)
W2015-10-02 22:14:51,161 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-19] Lost task 13.0 in 
stage 7378.0 (TID 14245, lynx1): ExecutorLostFailure (executor 7 lost)
I2015-10-02 22:14:51,161 DAGScheduler:[dag-scheduler-event-loop] Executor lost: 
7 (epoch 81)
[...]
I2015-10-02 22:14:51,162 YarnClientSchedulerBackend:[kill-executor-thread] 
Requesting to kill executor(s) 7
I2015-10-02 22:14:51,162 
BlockManagerMasterEndpoint:[sparkDriver-akka.actor.default-dispatcher-5] Trying 
to remove executor 7 from BlockManagerMaster.
I2015-10-02 22:14:51,162 
BlockManagerMasterEndpoint:[sparkDriver-akka.actor.default-dispatcher-5] 
Removing block manager BlockManagerId(7, lynx1, 60463)
I2015-10-02 22:14:51,162 BlockManagerMaster:[dag-scheduler-event-loop] Removed 
7 successfully in removeExecutor
{noformat}

I don't think this is "minor" bug. It's a very serious issue, basically making 
long-lived Spark applications on YARN unfeasible. What is the workaround? 
Should I be looking at the list of executors and restart the application if an 
executor is permanently lost? Or should I just make sure executors never die?


was (Author: darabos):
I work with Andras and can now substantiate his account with logs! We run on 
our own hardware on YARN. We start with 5 executors and then lose two.

Application master logs:

{verbatim}
15/10/02 13:01:51 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to 
kill executor(s) .
15/10/02 22:14:51 INFO yarn.YarnAllocator: Driver requested a total number of 4 
executor(s).
15/10/02 22:14:51 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to 
kill executor(s) 7.
15/10/02 23:17:51 INFO yarn.YarnAllocator: Driver requested a total number of 3 
executor(s).
15/10/02 23:17:51 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to 
kill executor(s) 3.
{verbatim}

Driver logs:

{verbatim}
W2015-10-02 22:14:51,159 
HeartbeatReceiver:[sparkDriver-akka.actor.default-dispatcher-19] Removing 
executor 7 with no recent heartbeats: 178981 ms exceeds timeout 12 ms
E2015-10-02 22:14:51,159 
YarnScheduler:[sparkDriver-akka.actor.default-dispatcher-19] Lost executor 7 on 
lynx1: Executor heartbeat timed out after 178981 ms
I2015-10-02 22:14:51,161 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-19] Re-queueing tasks 
for 7 from TaskSet 7378.0
W2015-10-02 22:14:51,161 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-19] Lost task 6.0 in 
stage 7378.0 (TID 14240, lynx1): ExecutorLostFailure (executor 7 lost)
W2015-10-02 22:14:51,161 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-19] Lost task 4.0 in 
stage 7378.0 (TID 14236, lynx1): ExecutorLostFailure (executor 7 lost)
W2015-10-02 22:14:51,161 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-19] Lost task 13.0 in 
stage 7378.0 (TID 14245, lynx1): ExecutorLostFailure (executor 7 lost)
I2015-10-02 22:14:51,161 DAGScheduler:[dag-scheduler-event-loop] Executor lost: 
7 (epoch 81)
[...]
I2015-10-02 22:14:51,162 YarnClientSchedulerBackend:[kill-executor-thread] 
Requesting to kill executor(s) 7
I2015-10-02 22:14:51,162 
BlockManagerMasterEndpoint:[sparkDriver-akka.actor.default-dispatcher-5] 

[jira] [Updated] (SPARK-10792) Spark + YARN – executor is not re-created

2015-10-05 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-10792:
---
   Priority: Critical  (was: Minor)
Component/s: (was: Streaming)
Summary: Spark + YARN – executor is not re-created  (was: Spark 
streaming + YARN – executor is not re-created on machine restart)

> Spark + YARN – executor is not re-created
> -
>
> Key: SPARK-10792
> URL: https://issues.apache.org/jira/browse/SPARK-10792
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.4.0
> Environment: - centos7 deployed on AWS
> - yarn / hadoop 2.6.0-cdh5.4.2
> - spark 1.4.0 compiled with hadoop 2.6
>Reporter: Adrian Tanase
>Priority: Critical
> Attachments: Screen Shot 2015-09-21 at 1.58.28 PM.png
>
>
> We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
> stateful app that reads from kafka (with the new direct API) and we’re 
> checkpointing to HDFS.
> During some resilience testing, we restarted one of the machines and brought 
> it back online. During the offline period, the Yarn cluster would not have 
> resources to re-create the missing executor.
> After starting all the services on the machine, it correctly joined the Yarn 
> cluster, however the spark streaming app does not seem to notice that the 
> resources are back and has not re-created the missing executor.
> The app is correctly running with 6 out of 7 executors, however it’s running 
> under capacity.
> If we manually kill the driver and re-submit the app to yarn, all the sate is 
> correctly recreated from checkpoint and all 7 executors are now online – 
> however this seems like a brutal workaround.
> *Scenarios tested to isolate the issue:*
> The expected outcome after a machine reboot + services back is that 
> processing continues on it. *FAILED* below means that processing continues in 
> a reduced capacity, as the machine lost rarely re-joins as container/executor 
> even if YARN sees it as healthy node.
> || No || Failure scenario || test result || data loss || Notes ||
> | 1  | Single node restart | FAILED | NO | Executor NOT redeployed when 
> machine comes back and services are restarted |
> | 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
> restoring services on machines that are down, the app OR kafka OR zookeeper 
> metadata gets corrupted, app crashes and can't be restarted w/o clearing 
> checkpoint -> dataloss. Root cause is unhealthy cluster when too many 
> machines are lost. |
> | 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
> restart, driver does not crash |
> | 4  | Graceful services restart | FAILED | NO | Behaves just like single 
> node restart even if we take the time to manually stop services before 
> machine reboot. |
> | 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
> will usually start even if YARN can't fullfill all the resource requests 
> (e.g. 5 out of 7 nodes are up when app is started). However, when the nodes 
> are added to YARN, we see that Spark deploys executors on them, as expected 
> in all the scenarios. |
> | 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts 
> it behaves like machine restart - the rest work as expected, 
> container/executor are redeployed in a matter of seconds |
> | 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
> validate if the behavior is caused by maxing out the cluster and having no 
> slack to redeploy a crashed node. We are still behaving like single node 
> restart event with lots of extra capacity in YARN - nodes, cores and RAM. |
> *Logs for Scenario 6 – correct behavior on process restart*
> {noformat}
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
> container_1442827158253_0004_01_04 (state: COMPLETE, exit status: 137)
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
> container_1442827158253_0004_01_04. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal
> ..
> (logical continuation from earlier restart attempt)
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
> containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container request (host: Any, 
> capability: )
> ..
> 2015-09-21 10:33:25,663 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Launching container 
> 

[jira] [Commented] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-10-05 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943405#comment-14943405
 ] 

Daniel Darabos commented on SPARK-10792:


I work with Andras and can now substantiate his account with logs! We run on 
our own hardware on YARN. We start with 5 executors and then lose two.

Application master logs:

{verbatim}
15/10/02 13:01:51 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to 
kill executor(s) .
15/10/02 22:14:51 INFO yarn.YarnAllocator: Driver requested a total number of 4 
executor(s).
15/10/02 22:14:51 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to 
kill executor(s) 7.
15/10/02 23:17:51 INFO yarn.YarnAllocator: Driver requested a total number of 3 
executor(s).
15/10/02 23:17:51 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to 
kill executor(s) 3.
{verbatim}

Driver logs:

{verbatim}
W2015-10-02 22:14:51,159 
HeartbeatReceiver:[sparkDriver-akka.actor.default-dispatcher-19] Removing 
executor 7 with no recent heartbeats: 178981 ms exceeds timeout 12 ms
E2015-10-02 22:14:51,159 
YarnScheduler:[sparkDriver-akka.actor.default-dispatcher-19] Lost executor 7 on 
lynx1: Executor heartbeat timed out after 178981 ms
I2015-10-02 22:14:51,161 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-19] Re-queueing tasks 
for 7 from TaskSet 7378.0
W2015-10-02 22:14:51,161 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-19] Lost task 6.0 in 
stage 7378.0 (TID 14240, lynx1): ExecutorLostFailure (executor 7 lost)
W2015-10-02 22:14:51,161 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-19] Lost task 4.0 in 
stage 7378.0 (TID 14236, lynx1): ExecutorLostFailure (executor 7 lost)
W2015-10-02 22:14:51,161 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-19] Lost task 13.0 in 
stage 7378.0 (TID 14245, lynx1): ExecutorLostFailure (executor 7 lost)
I2015-10-02 22:14:51,161 DAGScheduler:[dag-scheduler-event-loop] Executor lost: 
7 (epoch 81)
[...]
I2015-10-02 22:14:51,162 YarnClientSchedulerBackend:[kill-executor-thread] 
Requesting to kill executor(s) 7
I2015-10-02 22:14:51,162 
BlockManagerMasterEndpoint:[sparkDriver-akka.actor.default-dispatcher-5] Trying 
to remove executor 7 from BlockManagerMaster.
I2015-10-02 22:14:51,162 
BlockManagerMasterEndpoint:[sparkDriver-akka.actor.default-dispatcher-5] 
Removing block manager BlockManagerId(7, lynx1, 60463)
I2015-10-02 22:14:51,162 BlockManagerMaster:[dag-scheduler-event-loop] Removed 
7 successfully in removeExecutor
{verbatim}

I don't think this is "minor" bug. It's a very serious issue, basically making 
long-lived Spark applications on YARN unfeasible. What is the workaround? 
Should I be looking at the list of executors and restart the application if an 
executor is permanently lost? Or should I just make sure executors never die?

> Spark streaming + YARN – executor is not re-created on machine restart
> --
>
> Key: SPARK-10792
> URL: https://issues.apache.org/jira/browse/SPARK-10792
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, YARN
>Affects Versions: 1.4.0
> Environment: - centos7 deployed on AWS
> - yarn / hadoop 2.6.0-cdh5.4.2
> - spark 1.4.0 compiled with hadoop 2.6
>Reporter: Adrian Tanase
>Priority: Minor
> Attachments: Screen Shot 2015-09-21 at 1.58.28 PM.png
>
>
> We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
> stateful app that reads from kafka (with the new direct API) and we’re 
> checkpointing to HDFS.
> During some resilience testing, we restarted one of the machines and brought 
> it back online. During the offline period, the Yarn cluster would not have 
> resources to re-create the missing executor.
> After starting all the services on the machine, it correctly joined the Yarn 
> cluster, however the spark streaming app does not seem to notice that the 
> resources are back and has not re-created the missing executor.
> The app is correctly running with 6 out of 7 executors, however it’s running 
> under capacity.
> If we manually kill the driver and re-submit the app to yarn, all the sate is 
> correctly recreated from checkpoint and all 7 executors are now online – 
> however this seems like a brutal workaround.
> *Scenarios tested to isolate the issue:*
> The expected outcome after a machine reboot + services back is that 
> processing continues on it. *FAILED* below means that processing continues in 
> a reduced capacity, as the machine lost rarely re-joins as container/executor 
> even if YARN sees it as healthy node.
> || No || Failure scenario || test result || data loss || Notes ||
> | 1  | Single node restart | FAILED | NO | Executor NOT redeployed when 
> machine comes back and services are restarted |
> | 2  | Multi-node restart (quick 

[jira] [Commented] (SPARK-10792) Spark + YARN – executor is not re-created

2015-10-05 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943407#comment-14943407
 ] 

Daniel Darabos commented on SPARK-10792:


I've edited the priority, the title, and the component of the issue. Please 
correct me if you disagree! Thanks!

> Spark + YARN – executor is not re-created
> -
>
> Key: SPARK-10792
> URL: https://issues.apache.org/jira/browse/SPARK-10792
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.4.0
> Environment: - centos7 deployed on AWS
> - yarn / hadoop 2.6.0-cdh5.4.2
> - spark 1.4.0 compiled with hadoop 2.6
>Reporter: Adrian Tanase
>Priority: Critical
> Attachments: Screen Shot 2015-09-21 at 1.58.28 PM.png
>
>
> We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a 
> stateful app that reads from kafka (with the new direct API) and we’re 
> checkpointing to HDFS.
> During some resilience testing, we restarted one of the machines and brought 
> it back online. During the offline period, the Yarn cluster would not have 
> resources to re-create the missing executor.
> After starting all the services on the machine, it correctly joined the Yarn 
> cluster, however the spark streaming app does not seem to notice that the 
> resources are back and has not re-created the missing executor.
> The app is correctly running with 6 out of 7 executors, however it’s running 
> under capacity.
> If we manually kill the driver and re-submit the app to yarn, all the sate is 
> correctly recreated from checkpoint and all 7 executors are now online – 
> however this seems like a brutal workaround.
> *Scenarios tested to isolate the issue:*
> The expected outcome after a machine reboot + services back is that 
> processing continues on it. *FAILED* below means that processing continues in 
> a reduced capacity, as the machine lost rarely re-joins as container/executor 
> even if YARN sees it as healthy node.
> || No || Failure scenario || test result || data loss || Notes ||
> | 1  | Single node restart | FAILED | NO | Executor NOT redeployed when 
> machine comes back and services are restarted |
> | 2  | Multi-node restart (quick succession) | FAILED | YES | If we are not 
> restoring services on machines that are down, the app OR kafka OR zookeeper 
> metadata gets corrupted, app crashes and can't be restarted w/o clearing 
> checkpoint -> dataloss. Root cause is unhealthy cluster when too many 
> machines are lost. |
> | 3  | Multi-node restart (rolling) | FAILED | NO | Same as single node 
> restart, driver does not crash |
> | 4  | Graceful services restart | FAILED | NO | Behaves just like single 
> node restart even if we take the time to manually stop services before 
> machine reboot. |
> | 5  | Adding nodes to an incomplete cluster | SUCCESS | NO | The spark app 
> will usually start even if YARN can't fullfill all the resource requests 
> (e.g. 5 out of 7 nodes are up when app is started). However, when the nodes 
> are added to YARN, we see that Spark deploys executors on them, as expected 
> in all the scenarios. |
> | 6  | Restart executor process | PARTIAL SUCCESS | NO | 1 out of 5 attempts 
> it behaves like machine restart - the rest work as expected, 
> container/executor are redeployed in a matter of seconds |
> | 7  | Node restart on bigger cluster | FAILED | NO | We were trying to 
> validate if the behavior is caused by maxing out the cluster and having no 
> slack to redeploy a crashed node. We are still behaving like single node 
> restart event with lots of extra capacity in YARN - nodes, cores and RAM. |
> *Logs for Scenario 6 – correct behavior on process restart*
> {noformat}
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Completed container 
> container_1442827158253_0004_01_04 (state: COMPLETE, exit status: 137)
> 2015-09-21 11:00:11,193 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container marked as failed: 
> container_1442827158253_0004_01_04. Exit status: 137. Diagnostics: 
> Container killed on request. Exit code is 137
> Container exited with a non-zero exit code 137
> Killed by external signal
> ..
> (logical continuation from earlier restart attempt)
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Will request 1 executor 
> containers, each with 14 cores and 18022 MB memory including 1638 MB overhead
> 2015-09-21 10:33:20,658 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Container request (host: Any, 
> capability: )
> ..
> 2015-09-21 10:33:25,663 [Reporter] INFO  
> org.apache.spark.deploy.yarn.YarnAllocator - Launching container 
> container_1442827158253_0004_01_12 for on host 

[jira] [Commented] (SPARK-5077) Map output statuses can still exceed spark.akka.frameSize

2015-09-11 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740816#comment-14740816
 ] 

Daniel Darabos commented on SPARK-5077:
---

Hi Josh,
The MapOutputTracker errors are a source of pain for us. Today we hit the 100 
MB frame size with a 30,000-partition stage. A natural solution is to increase 
the frame size setting to 1 GB. But this got us thinking about what problems 
this would cause.

My reading of the code is that it would only affect messages that are larger 
than the frame size. That is, it will not cause smaller messages to suddenly 
start using more memory, for example by allocating a 1 GB buffer for each 
message. It would be reassuring if you could confirm that. This may even be a 
good addition to the documentation. It's not obvious why this setting would not 
be set to infinity, for example. Thanks!

> Map output statuses can still exceed spark.akka.frameSize
> -
>
> Key: SPARK-5077
> URL: https://issues.apache.org/jira/browse/SPARK-5077
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0, 1.3.0, 1.4.1
>Reporter: Josh Rosen
>
> Since HighlyCompressedMapOutputStatuses uses a bitmap for tracking empty 
> blocks, its size is not bounded and thus Spark is still susceptible to 
> "MapOutputTrackerMasterActor: Map output statuses
> were 11141547 bytes which exceeds spark.akka.frameSize"-type errors, even in 
> 1.2.0.
> We needed to use a bitmap for tracking zero-sized blocks (see SPARK-3740; 
> this isn't just a performance issue; it's necessary for correctness).  This 
> will require a bit more effort to fix, since we'll either have to find a way 
> to use a fixed size / capped size encoding for MapOutputStatuses (which might 
> require changes to let us fetch empty blocks safely) or figure out some other 
> strategy for shipping these statues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6497) Class is not registered: scala.reflect.ManifestFactory$$anon$9

2015-08-28 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14718559#comment-14718559
 ] 

Daniel Darabos commented on SPARK-6497:
---

I've checked 1.5.0-RC2 and the repro still works there. In production we've now 
also seen {{scala.reflect.ManifestFactory$$anon$10}} and 
{{scala.reflect.ManifestFactory$$anon$12}}. (On the other hand I'm not entirely 
sure now that we've seen {{scala.reflect.ManifestFactory$$anon$8}} as reported 
originally.)

 Class is not registered: scala.reflect.ManifestFactory$$anon$9
 --

 Key: SPARK-6497
 URL: https://issues.apache.org/jira/browse/SPARK-6497
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.4.0, 1.5.0
Reporter: Daniel Darabos

 This is a slight regression from Spark 1.2.1 to 1.3.0.
 {noformat}
 spark-1.2.1-bin-hadoop2.4/bin/spark-shell --conf 
 spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
 spark.kryo.registrationRequired=true --conf 
 'spark.kryo.classesToRegister=scala.collection.mutable.WrappedArray$ofRef,[Lscala.Tuple2;'
 scala sc.parallelize(Seq(1 - 1)).groupByKey.collect
 res0: Array[(Int, Iterable[Int])] = Array((1,CompactBuffer(1)))
 {noformat}
 {noformat}
 spark-1.3.0-bin-hadoop2.4/bin/spark-shell --conf 
 spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
 spark.kryo.registrationRequired=true --conf 
 'spark.kryo.classesToRegister=scala.collection.mutable.WrappedArray$ofRef,[Lscala.Tuple2;'
 scala sc.parallelize(Seq(1 - 1)).groupByKey.collect
 Lost task 1.0 in stage 3.0 (TID 25, localhost): 
 com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
 Class is not registered: scala.reflect.ManifestFactory$$anon$9
 Note: To register this class use: 
 kryo.register(scala.reflect.ManifestFactory$$anon$9.class);
 Serialization trace:
 evidence$1 (org.apache.spark.util.collection.CompactBuffer)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
   at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
   at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
   at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
   at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
   at 
 org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:161)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.IllegalArgumentException: Class is not registered: 
 scala.reflect.ManifestFactory$$anon$9
 Note: To register this class use: 
 kryo.register(scala.reflect.ManifestFactory$$anon$9.class);
   at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
   at 
 com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
   at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:561)
   ... 13 more
 {noformat}
 In our production code the exception is actually about 
 {{scala.reflect.ManifestFactory$$anon$8}} instead of 
 {{scala.reflect.ManifestFactory$$anon$9}} but it's probably the same thing. 
 Any idea what caused from 1.2.1 to 1.3.0 that could be causing this?
 We also get exceptions in 1.3.0 for {{scala.reflect.ClassTag$$anon$1}} and 
 {{java.lang.Class}}, but I haven't reduced them to a spark-shell reproduction 
 yet. We can of course just register these classes ourselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6497) Class is not registered: scala.reflect.ManifestFactory$$anon$9

2015-08-28 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-6497:
--
Priority: Major  (was: Minor)

 Class is not registered: scala.reflect.ManifestFactory$$anon$9
 --

 Key: SPARK-6497
 URL: https://issues.apache.org/jira/browse/SPARK-6497
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.4.0, 1.5.0
Reporter: Daniel Darabos

 This is a slight regression from Spark 1.2.1 to 1.3.0.
 {noformat}
 spark-1.2.1-bin-hadoop2.4/bin/spark-shell --conf 
 spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
 spark.kryo.registrationRequired=true --conf 
 'spark.kryo.classesToRegister=scala.collection.mutable.WrappedArray$ofRef,[Lscala.Tuple2;'
 scala sc.parallelize(Seq(1 - 1)).groupByKey.collect
 res0: Array[(Int, Iterable[Int])] = Array((1,CompactBuffer(1)))
 {noformat}
 {noformat}
 spark-1.3.0-bin-hadoop2.4/bin/spark-shell --conf 
 spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
 spark.kryo.registrationRequired=true --conf 
 'spark.kryo.classesToRegister=scala.collection.mutable.WrappedArray$ofRef,[Lscala.Tuple2;'
 scala sc.parallelize(Seq(1 - 1)).groupByKey.collect
 Lost task 1.0 in stage 3.0 (TID 25, localhost): 
 com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
 Class is not registered: scala.reflect.ManifestFactory$$anon$9
 Note: To register this class use: 
 kryo.register(scala.reflect.ManifestFactory$$anon$9.class);
 Serialization trace:
 evidence$1 (org.apache.spark.util.collection.CompactBuffer)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
   at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
   at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
   at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
   at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
   at 
 org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:161)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.IllegalArgumentException: Class is not registered: 
 scala.reflect.ManifestFactory$$anon$9
 Note: To register this class use: 
 kryo.register(scala.reflect.ManifestFactory$$anon$9.class);
   at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
   at 
 com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
   at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:561)
   ... 13 more
 {noformat}
 In our production code the exception is actually about 
 {{scala.reflect.ManifestFactory$$anon$8}} instead of 
 {{scala.reflect.ManifestFactory$$anon$9}} but it's probably the same thing. 
 Any idea what caused from 1.2.1 to 1.3.0 that could be causing this?
 We also get exceptions in 1.3.0 for {{scala.reflect.ClassTag$$anon$1}} and 
 {{java.lang.Class}}, but I haven't reduced them to a spark-shell reproduction 
 yet. We can of course just register these classes ourselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6497) Class is not registered: scala.reflect.ManifestFactory$$anon$9

2015-08-28 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-6497:
--
Affects Version/s: 1.5.0
   1.4.0

 Class is not registered: scala.reflect.ManifestFactory$$anon$9
 --

 Key: SPARK-6497
 URL: https://issues.apache.org/jira/browse/SPARK-6497
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0, 1.4.0, 1.5.0
Reporter: Daniel Darabos
Priority: Minor

 This is a slight regression from Spark 1.2.1 to 1.3.0.
 {noformat}
 spark-1.2.1-bin-hadoop2.4/bin/spark-shell --conf 
 spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
 spark.kryo.registrationRequired=true --conf 
 'spark.kryo.classesToRegister=scala.collection.mutable.WrappedArray$ofRef,[Lscala.Tuple2;'
 scala sc.parallelize(Seq(1 - 1)).groupByKey.collect
 res0: Array[(Int, Iterable[Int])] = Array((1,CompactBuffer(1)))
 {noformat}
 {noformat}
 spark-1.3.0-bin-hadoop2.4/bin/spark-shell --conf 
 spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
 spark.kryo.registrationRequired=true --conf 
 'spark.kryo.classesToRegister=scala.collection.mutable.WrappedArray$ofRef,[Lscala.Tuple2;'
 scala sc.parallelize(Seq(1 - 1)).groupByKey.collect
 Lost task 1.0 in stage 3.0 (TID 25, localhost): 
 com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
 Class is not registered: scala.reflect.ManifestFactory$$anon$9
 Note: To register this class use: 
 kryo.register(scala.reflect.ManifestFactory$$anon$9.class);
 Serialization trace:
 evidence$1 (org.apache.spark.util.collection.CompactBuffer)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
   at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
   at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
   at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
   at 
 com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
   at 
 org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:161)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.IllegalArgumentException: Class is not registered: 
 scala.reflect.ManifestFactory$$anon$9
 Note: To register this class use: 
 kryo.register(scala.reflect.ManifestFactory$$anon$9.class);
   at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
   at 
 com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
   at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
   at 
 com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:561)
   ... 13 more
 {noformat}
 In our production code the exception is actually about 
 {{scala.reflect.ManifestFactory$$anon$8}} instead of 
 {{scala.reflect.ManifestFactory$$anon$9}} but it's probably the same thing. 
 Any idea what caused from 1.2.1 to 1.3.0 that could be causing this?
 We also get exceptions in 1.3.0 for {{scala.reflect.ClassTag$$anon$1}} and 
 {{java.lang.Class}}, but I haven't reduced them to a spark-shell reproduction 
 yet. We can of course just register these classes ourselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9382) Tachyon version mismatch

2015-07-27 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-9382:
-

 Summary: Tachyon version mismatch
 Key: SPARK-9382
 URL: https://issues.apache.org/jira/browse/SPARK-9382
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Daniel Darabos


The spark-ec2 script installs Tachyon 0.5.0 ({{tachyon-0.5.0-bin.tar.gz}}). But 
the Tachyon client that comes with Spark 1.4.0 
({{spark-1.4.0-bin-hadoop1.tgz}}) is version 0.6.4.

The client is unable to connect to the server.

{noformat}
15/07/27 14:11:05 INFO : Tachyon client (version 0.6.4) is trying to connect 
master @ ec2-54-157-219-241.compute-1.amazonaws.com/10.45.51.200:19998
15/07/27 14:11:05 INFO : User registered at the master 
ec2-54-157-219-241.compute-1.amazonaws.com/10.45.51.200:19998 got UserId 737
15/07/27 14:11:05 ERROR : Invalid method name: 'user_getUfsAddress'
tachyon.org.apache.thrift.TApplicationException: Invalid method name: 
'user_getUfsAddress'
{noformat}

{{user_getUfsAddress}} was 
[added|https://github.com/amplab/tachyon/commit/c324f970cb08d2d5b49ecd7a66df313d93f1a23c]
 in Tachyon 0.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9382) Tachyon version mismatch

2015-07-27 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14642908#comment-14642908
 ] 

Daniel Darabos commented on SPARK-9382:
---

Ah, looks like this was just an oversight and has been fixed as part of 
SPARK-8322. 
(https://github.com/apache/spark/commit/141eab71ee3aa05da899ecfc6bae40b3798a4665)
 So only 1.4.0 would be affected by the mismatch. Serves me right for not using 
the latest version! Sorry for the noise.

 Tachyon version mismatch
 

 Key: SPARK-9382
 URL: https://issues.apache.org/jira/browse/SPARK-9382
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Daniel Darabos
 Fix For: 1.4.1


 The spark-ec2 script installs Tachyon 0.5.0 ({{tachyon-0.5.0-bin.tar.gz}}). 
 But the Tachyon client that comes with Spark 1.4.0 
 ({{spark-1.4.0-bin-hadoop1.tgz}}) is version 0.6.4.
 The client is unable to connect to the server.
 {noformat}
 15/07/27 14:11:05 INFO : Tachyon client (version 0.6.4) is trying to connect 
 master @ ec2-54-157-219-241.compute-1.amazonaws.com/10.45.51.200:19998
 15/07/27 14:11:05 INFO : User registered at the master 
 ec2-54-157-219-241.compute-1.amazonaws.com/10.45.51.200:19998 got UserId 737
 15/07/27 14:11:05 ERROR : Invalid method name: 'user_getUfsAddress'
 tachyon.org.apache.thrift.TApplicationException: Invalid method name: 
 'user_getUfsAddress'
 {noformat}
 {{user_getUfsAddress}} was 
 [added|https://github.com/amplab/tachyon/commit/c324f970cb08d2d5b49ecd7a66df313d93f1a23c]
  in Tachyon 0.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9382) Tachyon version mismatch

2015-07-27 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos closed SPARK-9382.
-
   Resolution: Fixed
Fix Version/s: 1.4.1

 Tachyon version mismatch
 

 Key: SPARK-9382
 URL: https://issues.apache.org/jira/browse/SPARK-9382
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Daniel Darabos
 Fix For: 1.4.1


 The spark-ec2 script installs Tachyon 0.5.0 ({{tachyon-0.5.0-bin.tar.gz}}). 
 But the Tachyon client that comes with Spark 1.4.0 
 ({{spark-1.4.0-bin-hadoop1.tgz}}) is version 0.6.4.
 The client is unable to connect to the server.
 {noformat}
 15/07/27 14:11:05 INFO : Tachyon client (version 0.6.4) is trying to connect 
 master @ ec2-54-157-219-241.compute-1.amazonaws.com/10.45.51.200:19998
 15/07/27 14:11:05 INFO : User registered at the master 
 ec2-54-157-219-241.compute-1.amazonaws.com/10.45.51.200:19998 got UserId 737
 15/07/27 14:11:05 ERROR : Invalid method name: 'user_getUfsAddress'
 tachyon.org.apache.thrift.TApplicationException: Invalid method name: 
 'user_getUfsAddress'
 {noformat}
 {{user_getUfsAddress}} was 
 [added|https://github.com/amplab/tachyon/commit/c324f970cb08d2d5b49ecd7a66df313d93f1a23c]
  in Tachyon 0.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8898) Jets3t hangs with more than 1 core

2015-07-17 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14631153#comment-14631153
 ] 

Daniel Darabos commented on SPARK-8898:
---

Unfortunately the {{spark-ec2}} script does not provide the Hadoop 2.6 build. 
https://github.com/mesos/spark-ec2/blob/branch-1.4/spark/init.sh gives the 
choice of the {{hadoop1}}, {{cdh4}}, and {{hadoop2.4}} builds. I tried all 
three with the same result.

 Jets3t hangs with more than 1 core
 --

 Key: SPARK-8898
 URL: https://issues.apache.org/jira/browse/SPARK-8898
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.0
 Environment: S3
Reporter: Daniel Darabos

 If I have an RDD that reads from S3 ({{newAPIHadoopFile}}), and try to write 
 this to S3 ({{saveAsNewAPIHadoopFile}}), it hangs if I have more than 1 core 
 per executor.
 It sounds like a race condition, but so far I have seen it trigger 100% of 
 the time. From a race for taking a limited number of connections I would 
 expect it to succeed at least on 1 task at least some of the time. But I 
 never saw a single completed task, except when running with 1-core executors.
 All executor threads hang with one of the following two stack traces:
 {noformat:title=Stack trace 1}
java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on 0x0007759cae70 (a 
 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
 at 
 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
 - locked 0x0007759cae70 (a 
 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
 at 
 org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
 at 
 org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
 at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
 at 
 org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
 at 
 org.jets3t.service.impl.rest.httpclient.RestS3Service.performRequest(RestS3Service.java:342)
 at 
 org.jets3t.service.impl.rest.httpclient.RestS3Service.performRestHead(RestS3Service.java:718)
 at 
 org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectImpl(RestS3Service.java:1599)
 at 
 org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectDetailsImpl(RestS3Service.java:1535)
 at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1987)
 at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1332)
 at 
 org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:107)
 at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
 at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
 at org.apache.hadoop.fs.s3native.$Proxy8.retrieveMetadata(Unknown 
 Source)
 at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:414)
 at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332)
 at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:341)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:851)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:832)
 at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:731)
 at 
 org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1030)
 at 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1014)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {noformat}
 

[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-07-15 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628380#comment-14628380
 ] 

Daniel Darabos commented on SPARK-4879:
---

Thanks, Josh! I wonder when the post-save-action check you suggest would run. 
Based on the above log snippet I think it's quite likely that at the point when 
the stage finishes the files are all there. I suspect it's the speculative task 
which fails after the stage has finished that deletes the file. It may be hard 
to check at the right point in time.

I'll try to find a smaller reproduction. First it would be great if I could 
reproduce on my local machine instead of starting EC2 clusters. Then I just 
need to dig out the key operations from our -entangled mess of a- _highly 
sophisticated_ codebase.

One more thing that occurs to me is that perhaps there should never be a line 
of code that deletes an output file. I haven't had a chance to dig into the 
code yet, and I'm sure there is a reason for it, but perhaps the same goal 
could be accomplished without deleting output files. What do you think?

 Missing output partitions after job completes with speculative execution
 

 Key: SPARK-4879
 URL: https://issues.apache.org/jira/browse/SPARK-4879
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0

 Attachments: speculation.txt, speculation2.txt


 When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
 save output files may report that they have completed successfully even 
 though some output partitions written by speculative tasks may be missing.
 h3. Reproduction
 This symptom was reported to me by a Spark user and I've been doing my own 
 investigation to try to come up with an in-house reproduction.
 I'm still working on a reliable local reproduction for this issue, which is a 
 little tricky because Spark won't schedule speculated tasks on the same host 
 as the original task, so you need an actual (or containerized) multi-host 
 cluster to test speculation.  Here's a simple reproduction of some of the 
 symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
 spark.speculation=true}}:
 {code}
 // Rig a job such that all but one of the tasks complete instantly
 // and one task runs for 20 seconds on its first attempt and instantly
 // on its second attempt:
 val numTasks = 100
 sc.parallelize(1 to numTasks, 
 numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =
   if (ctx.partitionId == 0) {  // If this is the one task that should run 
 really slow
 if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
  Thread.sleep(20 * 1000)
 }
   }
   iter
 }.map(x = (x, x)).saveAsTextFile(/test4)
 {code}
 When I run this, I end up with a job that completes quickly (due to 
 speculation) but reports failures from the speculated task:
 {code}
 [...]
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
 (100/100)
 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
 console:22) finished in 0.856 s
 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
 console:22, took 0.885438374 s
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
 for 70.1 in stage 3.0 because task 70 has already completed successfully
 scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
 stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
 java.io.IOException: Failed to save output of task: 
 attempt_201412110141_0003_m_49_413
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
 org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 

[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-07-15 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628086#comment-14628086
 ] 

Daniel Darabos commented on SPARK-4879:
---

I've managed to reproduce on Spark 1.3.1 too (pre-built for Hadoop 1). I ran on 
EC2 with the {{spark-ec2}} script and used the ephemeral HDFS. I used a 
5-machine cluster and repeatedly ran a complex test suite for about 30 minutes 
until the error was triggered. Here are the relevant logs:

{noformat}
I2015-07-15 13:45:19,954 TaskSetManager:[task-result-getter-2] Finished task 
198.0 in stage 320.0 (TID 13290) in 568 ms on ip-10-153-188-224.ec2.internal 
(195/200)
I2015-07-15 13:45:21,174 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-2] Marking task 197 
in stage 320.0 (on ip-10-231-214-6.ec2.internal) as speculatable because it ran 
more than 1240 ms
I2015-07-15 13:45:21,174 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-2] Marking task 176 
in stage 320.0 (on ip-10-231-214-6.ec2.internal) as speculatable because it ran 
more than 1240 ms
I2015-07-15 13:45:21,174 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-2] Marking task 194 
in stage 320.0 (on ip-10-231-214-6.ec2.internal) as speculatable because it ran 
more than 1240 ms
I2015-07-15 13:45:21,174 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-2] Marking task 196 
in stage 320.0 (on ip-10-231-214-6.ec2.internal) as speculatable because it ran 
more than 1240 ms
I2015-07-15 13:45:21,175 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-2] Marking task 195 
in stage 320.0 (on ip-10-231-214-6.ec2.internal) as speculatable because it ran 
more than 1240 ms
I2015-07-15 13:45:21,175 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-2] Starting task 
197.1 in stage 320.0 (TID 13292, ip-10-153-188-224.ec2.internal, PROCESS_LOCAL, 
1612 bytes)
I2015-07-15 13:45:21,175 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-2] Starting task 
176.1 in stage 320.0 (TID 13293, ip-10-63-27-248.ec2.internal, PROCESS_LOCAL, 
1612 bytes)
I2015-07-15 13:45:21,175 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-2] Starting task 
194.1 in stage 320.0 (TID 13294, ip-10-154-1-239.ec2.internal, PROCESS_LOCAL, 
1612 bytes)
I2015-07-15 13:45:21,175 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-2] Starting task 
195.1 in stage 320.0 (TID 13295, ip-10-228-67-34.ec2.internal, PROCESS_LOCAL, 
1612 bytes)
I2015-07-15 13:45:21,175 
TaskSetManager:[sparkDriver-akka.actor.default-dispatcher-2] Starting task 
196.1 in stage 320.0 (TID 13296, ip-10-153-188-224.ec2.internal, PROCESS_LOCAL, 
1612 bytes)
I2015-07-15 13:45:21,402 TaskSetManager:[task-result-getter-3] Finished task 
176.0 in stage 320.0 (TID 13268) in 2223 ms on ip-10-231-214-6.ec2.internal 
(196/200)
I2015-07-15 13:45:21,445 TaskSetManager:[task-result-getter-0] Finished task 
195.0 in stage 320.0 (TID 13287) in 2113 ms on ip-10-231-214-6.ec2.internal 
(197/200)
I2015-07-15 13:45:21,461 TaskSetManager:[task-result-getter-1] Finished task 
196.1 in stage 320.0 (TID 13296) in 285 ms on ip-10-153-188-224.ec2.internal 
(198/200)
I2015-07-15 13:45:21,464 TaskSetManager:[task-result-getter-2] Finished task 
194.1 in stage 320.0 (TID 13294) in 287 ms on ip-10-154-1-239.ec2.internal 
(199/200)
I2015-07-15 13:45:21,465 TaskSetManager:[task-result-getter-3] Ignoring 
task-finished event for 176.1 in stage 320.0 because task 176 has already 
completed successfully
I2015-07-15 13:45:21,468 TaskSetManager:[task-result-getter-0] Finished task 
197.1 in stage 320.0 (TID 13292) in 292 ms on ip-10-153-188-224.ec2.internal 
(200/200)
I2015-07-15 13:45:21,468 DAGScheduler:[dag-scheduler-event-loop] Stage 320 
(saveAsNewAPIHadoopFile at HadoopFile.scala:208) finished in 4.802 s
I2015-07-15 13:45:21,468 DAGScheduler:[DataManager-5] Job 46 finished: 
saveAsNewAPIHadoopFile at HadoopFile.scala:208, took 4.836626 s
W2015-07-15 13:45:21,478 TaskSetManager:[task-result-getter-1] Lost task 195.1 
in stage 320.0 (TID 13295, ip-10-228-67-34.ec2.internal): java.io.IOException: 
Failed to save output of task: attempt_201507151345_0628_r_000195_1
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:203)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:214)
at 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitTask(FileOutputCommitter.java:167)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:1009)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:979)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 

[jira] [Commented] (SPARK-8836) Sorted join

2015-07-15 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628030#comment-14628030
 ] 

Daniel Darabos commented on SPARK-8836:
---

I just noticed this has actually been implemented already in SPARK-2213! Cool! 
Too bad it was done as an optimization for SparkSQL and not in Spark Core.

 Sorted join
 ---

 Key: SPARK-8836
 URL: https://issues.apache.org/jira/browse/SPARK-8836
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Minor

 In my [Spark Summit 2015 
 presentation|https://spark-summit.org/2015/events/interactive-graph-analytics-with-spark/]
  I touted sorted joins. It would be a shame to talk about how great they are 
 and then not try to introduce them into Spark.
 When joining co-partitioned RDDs, the current Spark implementation builds a 
 map of the contents of one partition and looks up the items from the other 
 partition. 
 (https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala
  using AppendOnlyMap.)
 Another option for lining up the keys from the two partitions is to sort them 
 both and then merge. Just doing this may already be a performance improvement.
 But what we do is we sort the partitions up-front, and then enjoy the 
 benefits over many operations. Our joins are 10x faster than normal Spark 
 joins and don't trigger GC. The hash-based join builds a large hashmap (the 
 size of the partition) while the sorted join does not allocate any memory. 
 The sorted partitions also benefit other operations, such as distinct, where 
 we also avoid building a hashmap. (I think the logic is similar to sort-based 
 shuffle, just at a later stage of the process.)
 Our implementation is based on zipPartitions, and this is entirely workable. 
 We have a custom RDD subclass (SortedRDD) and it overrides a bunch of 
 methods. We have an implicit class that adds a toSortedRDD method on 
 pair-RDDs.
 But I think integrating this into Spark could take it a step further. What we 
 have not investigated is cases where the sorting could be skipped. For 
 example when an RDD came out of a sort-based shuffle, its partitions will be 
 sorted, right? So even if the user never asks for the partitions to be 
 sorted, they can become so, and the faster sorted implementations of join, 
 distinct, etc could kick in automatically. This would speed up applications 
 without any change in their code.
 Instead of a subclass it would probably be best to do this with a simple 
 hasSortedPartitions variable in the RDD. Then perhaps operations could have 
 a preservesPartitionOrder parameter, like it is done with partitioner and 
 preservesPartitioning now. (For example filter(), mapValues(), join(), and 
 distinct() all keep the partition sorted.)
 What do you think about all this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-07-12 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14624000#comment-14624000
 ] 

Daniel Darabos commented on SPARK-4879:
---

Good idea! I'll try with 1.3.1 next week.

 Missing output partitions after job completes with speculative execution
 

 Key: SPARK-4879
 URL: https://issues.apache.org/jira/browse/SPARK-4879
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0

 Attachments: speculation.txt, speculation2.txt


 When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
 save output files may report that they have completed successfully even 
 though some output partitions written by speculative tasks may be missing.
 h3. Reproduction
 This symptom was reported to me by a Spark user and I've been doing my own 
 investigation to try to come up with an in-house reproduction.
 I'm still working on a reliable local reproduction for this issue, which is a 
 little tricky because Spark won't schedule speculated tasks on the same host 
 as the original task, so you need an actual (or containerized) multi-host 
 cluster to test speculation.  Here's a simple reproduction of some of the 
 symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
 spark.speculation=true}}:
 {code}
 // Rig a job such that all but one of the tasks complete instantly
 // and one task runs for 20 seconds on its first attempt and instantly
 // on its second attempt:
 val numTasks = 100
 sc.parallelize(1 to numTasks, 
 numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =
   if (ctx.partitionId == 0) {  // If this is the one task that should run 
 really slow
 if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
  Thread.sleep(20 * 1000)
 }
   }
   iter
 }.map(x = (x, x)).saveAsTextFile(/test4)
 {code}
 When I run this, I end up with a job that completes quickly (due to 
 speculation) but reports failures from the speculated task:
 {code}
 [...]
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
 (100/100)
 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
 console:22) finished in 0.856 s
 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
 console:22, took 0.885438374 s
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
 for 70.1 in stage 3.0 because task 70 has already completed successfully
 scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
 stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
 java.io.IOException: Failed to save output of task: 
 attempt_201412110141_0003_m_49_413
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
 org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 One interesting thing to note about this stack trace: if we look at 
 {{FileOutputCommitter.java:160}} 
 ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]),
  this point in the execution seems to correspond to a case where a task 
 completes, attempts to commit its output, fails for some reason, then deletes 
 the destination file, tries again, and fails:
 {code}
  if (fs.isFile(taskOutput)) {
 152  Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput, 
 153  getTempTaskOutputPath(context));
 154  if (!fs.rename(taskOutput, 

[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-07-10 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1466#comment-1466
 ] 

Daniel Darabos commented on SPARK-4879:
---

I wonder if this issue is serious enough to note in the documentation. What do 
you think about adding a big fat warning for speculative execution until it is 
fixed? Enabling speculative execution may lead to missing output files? Or 
perhaps add verification pass that checks if all the outputs are present and 
raises an exception if not.

Silently dropping output files is a horrible bug. We've been debugging a 
somewhat mythological data corruption issue for about a month, and now we 
realize that this issue (SPARK-4879) is a very plausible explanation. We have 
never been able to reproduce it, but we have a log file, and it shows a 
speculative task for a {{saveAsNewAPIHadoopFile}} stage.

 Missing output partitions after job completes with speculative execution
 

 Key: SPARK-4879
 URL: https://issues.apache.org/jira/browse/SPARK-4879
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0

 Attachments: speculation.txt, speculation2.txt


 When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
 save output files may report that they have completed successfully even 
 though some output partitions written by speculative tasks may be missing.
 h3. Reproduction
 This symptom was reported to me by a Spark user and I've been doing my own 
 investigation to try to come up with an in-house reproduction.
 I'm still working on a reliable local reproduction for this issue, which is a 
 little tricky because Spark won't schedule speculated tasks on the same host 
 as the original task, so you need an actual (or containerized) multi-host 
 cluster to test speculation.  Here's a simple reproduction of some of the 
 symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
 spark.speculation=true}}:
 {code}
 // Rig a job such that all but one of the tasks complete instantly
 // and one task runs for 20 seconds on its first attempt and instantly
 // on its second attempt:
 val numTasks = 100
 sc.parallelize(1 to numTasks, 
 numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =
   if (ctx.partitionId == 0) {  // If this is the one task that should run 
 really slow
 if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
  Thread.sleep(20 * 1000)
 }
   }
   iter
 }.map(x = (x, x)).saveAsTextFile(/test4)
 {code}
 When I run this, I end up with a job that completes quickly (due to 
 speculation) but reports failures from the speculated task:
 {code}
 [...]
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
 (100/100)
 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
 console:22) finished in 0.856 s
 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
 console:22, took 0.885438374 s
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
 for 70.1 in stage 3.0 because task 70 has already completed successfully
 scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
 stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
 java.io.IOException: Failed to save output of task: 
 attempt_201412110141_0003_m_49_413
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
 org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 One interesting thing to note about this stack trace: if we look at 
 

[jira] [Comment Edited] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-07-10 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1466#comment-1466
 ] 

Daniel Darabos edited comment on SPARK-4879 at 7/10/15 12:25 PM:
-

I wonder if this issue is serious enough to note in the documentation. What do 
you think about adding a big fat warning for speculative execution until it is 
fixed? Enabling speculative execution may lead to missing output files? Or 
perhaps add a verification pass that checks if all the outputs are present and 
raises an exception if not.

Silently dropping output files is a horrible bug. We've been debugging a 
somewhat mythological data corruption issue for about a month, and now we 
realize that this issue (SPARK-4879) is a very plausible explanation. We have 
never been able to reproduce it, but we have a log file, and it shows a 
speculative task for a {{saveAsNewAPIHadoopFile}} stage.


was (Author: darabos):
I wonder if this issue is serious enough to note in the documentation. What do 
you think about adding a big fat warning for speculative execution until it is 
fixed? Enabling speculative execution may lead to missing output files? Or 
perhaps add verification pass that checks if all the outputs are present and 
raises an exception if not.

Silently dropping output files is a horrible bug. We've been debugging a 
somewhat mythological data corruption issue for about a month, and now we 
realize that this issue (SPARK-4879) is a very plausible explanation. We have 
never been able to reproduce it, but we have a log file, and it shows a 
speculative task for a {{saveAsNewAPIHadoopFile}} stage.

 Missing output partitions after job completes with speculative execution
 

 Key: SPARK-4879
 URL: https://issues.apache.org/jira/browse/SPARK-4879
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0

 Attachments: speculation.txt, speculation2.txt


 When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
 save output files may report that they have completed successfully even 
 though some output partitions written by speculative tasks may be missing.
 h3. Reproduction
 This symptom was reported to me by a Spark user and I've been doing my own 
 investigation to try to come up with an in-house reproduction.
 I'm still working on a reliable local reproduction for this issue, which is a 
 little tricky because Spark won't schedule speculated tasks on the same host 
 as the original task, so you need an actual (or containerized) multi-host 
 cluster to test speculation.  Here's a simple reproduction of some of the 
 symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
 spark.speculation=true}}:
 {code}
 // Rig a job such that all but one of the tasks complete instantly
 // and one task runs for 20 seconds on its first attempt and instantly
 // on its second attempt:
 val numTasks = 100
 sc.parallelize(1 to numTasks, 
 numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =
   if (ctx.partitionId == 0) {  // If this is the one task that should run 
 really slow
 if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
  Thread.sleep(20 * 1000)
 }
   }
   iter
 }.map(x = (x, x)).saveAsTextFile(/test4)
 {code}
 When I run this, I end up with a job that completes quickly (due to 
 speculation) but reports failures from the speculated task:
 {code}
 [...]
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
 (100/100)
 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
 console:22) finished in 0.856 s
 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
 console:22, took 0.885438374 s
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
 for 70.1 in stage 3.0 because task 70 has already completed successfully
 scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
 stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
 java.io.IOException: Failed to save output of task: 
 attempt_201412110141_0003_m_49_413
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
 org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 

[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-07-09 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14620570#comment-14620570
 ] 

Daniel Darabos commented on SPARK-4879:
---

I have a fairly reliable reproduction with Spark 1.4.0 and HDFS. I'm running on 
10 EC2 m3.2xlarge instances using the ephemeral HDFS. If {spark.speculation} is 
true, I get hit by this 50% of the time or more. It's a fairly complex 
workload, not something you can test in a {spark-shell}. What I saw was that I 
saved a 400-partition RDD with {saveAsNewAPIHadoopFile} (which returned without 
error) and when I tried to read it back, the files for partitions 323 and 324 
were missing. (In the case that I took a closer look at.) I don't have the logs 
at hand now, but it's like you describe I think ({Failed to save output of 
task}). I can add them later if it would be useful.

I turned off {spark.speculation} and haven't seen the issue since.

Is there anything I could do to help debug this issue?

 Missing output partitions after job completes with speculative execution
 

 Key: SPARK-4879
 URL: https://issues.apache.org/jira/browse/SPARK-4879
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0

 Attachments: speculation.txt, speculation2.txt


 When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
 save output files may report that they have completed successfully even 
 though some output partitions written by speculative tasks may be missing.
 h3. Reproduction
 This symptom was reported to me by a Spark user and I've been doing my own 
 investigation to try to come up with an in-house reproduction.
 I'm still working on a reliable local reproduction for this issue, which is a 
 little tricky because Spark won't schedule speculated tasks on the same host 
 as the original task, so you need an actual (or containerized) multi-host 
 cluster to test speculation.  Here's a simple reproduction of some of the 
 symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
 spark.speculation=true}}:
 {code}
 // Rig a job such that all but one of the tasks complete instantly
 // and one task runs for 20 seconds on its first attempt and instantly
 // on its second attempt:
 val numTasks = 100
 sc.parallelize(1 to numTasks, 
 numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =
   if (ctx.partitionId == 0) {  // If this is the one task that should run 
 really slow
 if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
  Thread.sleep(20 * 1000)
 }
   }
   iter
 }.map(x = (x, x)).saveAsTextFile(/test4)
 {code}
 When I run this, I end up with a job that completes quickly (due to 
 speculation) but reports failures from the speculated task:
 {code}
 [...]
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
 (100/100)
 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
 console:22) finished in 0.856 s
 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
 console:22, took 0.885438374 s
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
 for 70.1 in stage 3.0 because task 70 has already completed successfully
 scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
 stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
 java.io.IOException: Failed to save output of task: 
 attempt_201412110141_0003_m_49_413
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
 org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 

[jira] [Comment Edited] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-07-09 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14620570#comment-14620570
 ] 

Daniel Darabos edited comment on SPARK-4879 at 7/9/15 6:15 PM:
---

I have a fairly reliable reproduction with Spark 1.4.0 and HDFS. I'm running on 
10 EC2 m3.2xlarge instances using the ephemeral HDFS. If {{spark.speculation}} 
is true, I get hit by this 50% of the time or more. It's a fairly complex 
workload, not something you can test in a {spark-shell}. What I saw was that I 
saved a 400-partition RDD with {{saveAsNewAPIHadoopFile}} (which returned 
without error) and when I tried to read it back, the files for partitions 323 
and 324 were missing. (In the case that I took a closer look at.) I don't have 
the logs at hand now, but it's like you describe I think ({{Failed to save 
output of task}}). I can add them later if it would be useful.

I turned off {{spark.speculation}} and haven't seen the issue since.

Is there anything I could do to help debug this issue?


was (Author: darabos):
I have a fairly reliable reproduction with Spark 1.4.0 and HDFS. I'm running on 
10 EC2 m3.2xlarge instances using the ephemeral HDFS. If {spark.speculation} is 
true, I get hit by this 50% of the time or more. It's a fairly complex 
workload, not something you can test in a {spark-shell}. What I saw was that I 
saved a 400-partition RDD with {saveAsNewAPIHadoopFile} (which returned without 
error) and when I tried to read it back, the files for partitions 323 and 324 
were missing. (In the case that I took a closer look at.) I don't have the logs 
at hand now, but it's like you describe I think ({Failed to save output of 
task}). I can add them later if it would be useful.

I turned off {spark.speculation} and haven't seen the issue since.

Is there anything I could do to help debug this issue?

 Missing output partitions after job completes with speculative execution
 

 Key: SPARK-4879
 URL: https://issues.apache.org/jira/browse/SPARK-4879
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0

 Attachments: speculation.txt, speculation2.txt


 When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
 save output files may report that they have completed successfully even 
 though some output partitions written by speculative tasks may be missing.
 h3. Reproduction
 This symptom was reported to me by a Spark user and I've been doing my own 
 investigation to try to come up with an in-house reproduction.
 I'm still working on a reliable local reproduction for this issue, which is a 
 little tricky because Spark won't schedule speculated tasks on the same host 
 as the original task, so you need an actual (or containerized) multi-host 
 cluster to test speculation.  Here's a simple reproduction of some of the 
 symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
 spark.speculation=true}}:
 {code}
 // Rig a job such that all but one of the tasks complete instantly
 // and one task runs for 20 seconds on its first attempt and instantly
 // on its second attempt:
 val numTasks = 100
 sc.parallelize(1 to numTasks, 
 numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =
   if (ctx.partitionId == 0) {  // If this is the one task that should run 
 really slow
 if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
  Thread.sleep(20 * 1000)
 }
   }
   iter
 }.map(x = (x, x)).saveAsTextFile(/test4)
 {code}
 When I run this, I end up with a job that completes quickly (due to 
 speculation) but reports failures from the speculated task:
 {code}
 [...]
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
 (100/100)
 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
 console:22) finished in 0.856 s
 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
 console:22, took 0.885438374 s
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
 for 70.1 in stage 3.0 because task 70 has already completed successfully
 scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
 stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
 java.io.IOException: Failed to save output of task: 
 attempt_201412110141_0003_m_49_413
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)

[jira] [Comment Edited] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-07-09 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14620570#comment-14620570
 ] 

Daniel Darabos edited comment on SPARK-4879 at 7/9/15 6:15 PM:
---

I have a fairly reliable reproduction with Spark 1.4.0 and HDFS. I'm running on 
10 EC2 m3.2xlarge instances using the ephemeral HDFS. If {{spark.speculation}} 
is true, I get hit by this 50% of the time or more. It's a fairly complex 
workload, not something you can test in a {{spark-shell}}. What I saw was that 
I saved a 400-partition RDD with {{saveAsNewAPIHadoopFile}} (which returned 
without error) and when I tried to read it back, the files for partitions 323 
and 324 were missing. (In the case that I took a closer look at.) I don't have 
the logs at hand now, but it's like you describe I think ({{Failed to save 
output of task}}). I can add them later if it would be useful.

I turned off {{spark.speculation}} and haven't seen the issue since.

Is there anything I could do to help debug this issue?


was (Author: darabos):
I have a fairly reliable reproduction with Spark 1.4.0 and HDFS. I'm running on 
10 EC2 m3.2xlarge instances using the ephemeral HDFS. If {{spark.speculation}} 
is true, I get hit by this 50% of the time or more. It's a fairly complex 
workload, not something you can test in a {spark-shell}. What I saw was that I 
saved a 400-partition RDD with {{saveAsNewAPIHadoopFile}} (which returned 
without error) and when I tried to read it back, the files for partitions 323 
and 324 were missing. (In the case that I took a closer look at.) I don't have 
the logs at hand now, but it's like you describe I think ({{Failed to save 
output of task}}). I can add them later if it would be useful.

I turned off {{spark.speculation}} and haven't seen the issue since.

Is there anything I could do to help debug this issue?

 Missing output partitions after job completes with speculative execution
 

 Key: SPARK-4879
 URL: https://issues.apache.org/jira/browse/SPARK-4879
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0

 Attachments: speculation.txt, speculation2.txt


 When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
 save output files may report that they have completed successfully even 
 though some output partitions written by speculative tasks may be missing.
 h3. Reproduction
 This symptom was reported to me by a Spark user and I've been doing my own 
 investigation to try to come up with an in-house reproduction.
 I'm still working on a reliable local reproduction for this issue, which is a 
 little tricky because Spark won't schedule speculated tasks on the same host 
 as the original task, so you need an actual (or containerized) multi-host 
 cluster to test speculation.  Here's a simple reproduction of some of the 
 symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
 spark.speculation=true}}:
 {code}
 // Rig a job such that all but one of the tasks complete instantly
 // and one task runs for 20 seconds on its first attempt and instantly
 // on its second attempt:
 val numTasks = 100
 sc.parallelize(1 to numTasks, 
 numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =
   if (ctx.partitionId == 0) {  // If this is the one task that should run 
 really slow
 if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
  Thread.sleep(20 * 1000)
 }
   }
   iter
 }.map(x = (x, x)).saveAsTextFile(/test4)
 {code}
 When I run this, I end up with a job that completes quickly (due to 
 speculation) but reports failures from the speculated task:
 {code}
 [...]
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
 (100/100)
 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
 console:22) finished in 0.856 s
 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
 console:22, took 0.885438374 s
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
 for 70.1 in stage 3.0 because task 70 has already completed successfully
 scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
 stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
 java.io.IOException: Failed to save output of task: 
 attempt_201412110141_0003_m_49_413
 
 

[jira] [Created] (SPARK-8960) Style cleanup of spark_ec2.py

2015-07-09 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-8960:
-

 Summary: Style cleanup of spark_ec2.py
 Key: SPARK-8960
 URL: https://issues.apache.org/jira/browse/SPARK-8960
 Project: Spark
  Issue Type: Task
  Components: EC2
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Trivial


The spark_ec2.py script could use some cleanup I think. There are simple style 
issues like mixing single and double quotes, but also some rather un-Pythonic 
constructs (e.g. 
https://github.com/apache/spark/pull/6336#commitcomment-12088624 that sparked 
this JIRA). Whenever I read it, I always find something that is too minor for a 
pull request/JIRA, but I'd fix it if it was my code. Perhaps we can address 
such issues in this JIRA.

The intention is not to introduce any behavioral changes. It's hard to verify 
this without testing, so perhaps we should also add some kind of test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8960) Style cleanup of spark_ec2.py

2015-07-09 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14621397#comment-14621397
 ] 

Daniel Darabos commented on SPARK-8960:
---

 So, can I sort of hijack this to continue the thread about moving this and 
 the mesos-based Spark EC2 scripts into a separate repo? then you and whoever 
 wishes to maintain it could have it much more liberally.

I just mentioned this idea to meawoppl. (Sorry, I didn't remember it was 
yours.) I think it makes sense! Is there a JIRA for that? Or do you want to 
hijack this? :)

My only worry is that a separate repo would possibly allow the script to fall 
into disrepair. I guess the question is whether it is critical for Spark to 
have solid EC2 support. The answer may be no.

I wouldn't volunteer for maintaining the new repo though. I don't know what 
most of the script does. I just wanted to fix a few things here and there. 
Hopefully meawoppl feels otherwise!

 Style cleanup of spark_ec2.py
 -

 Key: SPARK-8960
 URL: https://issues.apache.org/jira/browse/SPARK-8960
 Project: Spark
  Issue Type: Task
  Components: EC2
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Trivial

 The spark_ec2.py script could use some cleanup I think. There are simple 
 style issues like mixing single and double quotes, but also some rather 
 un-Pythonic constructs (e.g. 
 https://github.com/apache/spark/pull/6336#commitcomment-12088624 that sparked 
 this JIRA). Whenever I read it, I always find something that is too minor for 
 a pull request/JIRA, but I'd fix it if it was my code. Perhaps we can address 
 such issues in this JIRA.
 The intention is not to introduce any behavioral changes. It's hard to verify 
 this without testing, so perhaps we should also add some kind of test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8893) Require positive partition counts in RDD.repartition

2015-07-08 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-8893:
--
Description: 
What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would 
expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}}  1, it returns 
{{Array()}}. I think instead it should throw an {{IllegalArgumentException}}.

I think the case is pretty clear for {{p}}  0. But the behavior for {{p}} = 0 
is also error prone. In fact that's how I found this strange behavior. I used 
{{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded 
down to zero and the results surprised me. I'd prefer an exception instead of 
unexpected (corrupt) results.

I'm happy to send a pull request for this.

  was:
What does {{sc.parallelize(1 to 3).repartition(x).collect}} return? I would 
expect {{Array(1, 2, 3)}} regardless of {{x}}. But if {{x}}  1, it returns 
{{Array()}}. I think instead it should throw an {{IllegalArgumentException}}.

I think the case is pretty clear for {{x}}  0. But the behavior for {{x}} = 0 
is also error prone. In fact that's how I found this strange behavior. I used 
{{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded 
down to zero and the results surprised me. I'd prefer an exception instead of 
unexpected (corrupt) results.

I'm happy to send a pull request for this.


 Require positive partition counts in RDD.repartition
 

 Key: SPARK-8893
 URL: https://issues.apache.org/jira/browse/SPARK-8893
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Trivial

 What does {{sc.parallelize(1 to 3).repartition(p).collect}} return? I would 
 expect {{Array(1, 2, 3)}} regardless of {{p}}. But if {{p}}  1, it returns 
 {{Array()}}. I think instead it should throw an {{IllegalArgumentException}}.
 I think the case is pretty clear for {{p}}  0. But the behavior for {{p}} = 
 0 is also error prone. In fact that's how I found this strange behavior. I 
 used {{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was 
 rounded down to zero and the results surprised me. I'd prefer an exception 
 instead of unexpected (corrupt) results.
 I'm happy to send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8893) Require positive partition counts in RDD.repartition

2015-07-08 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-8893:
-

 Summary: Require positive partition counts in RDD.repartition
 Key: SPARK-8893
 URL: https://issues.apache.org/jira/browse/SPARK-8893
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Trivial


What does {{sc.parallelize(1 to 3).repartition(x).collect}} return? I would 
expect {{Array(1, 2, 3)}} regardless of {{x}}. But if {{x}}  1, it returns 
{{Array()}}. I think instead it should throw an {{IllegalArgumentException}}.

I think the case is pretty clear for {{x}}  0. But the behavior for {{x}} = 0 
is also error prone. In fact that's how I found this strange behavior. I used 
{{rdd.repartition(a/b)}} with positive {{a}} and {{b}}, but {{a/b}} was rounded 
down to zero and the results surprised me. I'd prefer an exception instead of 
unexpected (corrupt) results.

I'm happy to send a pull request for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8898) Jets3t hangs with more than 1 core

2015-07-08 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-8898:
-

 Summary: Jets3t hangs with more than 1 core
 Key: SPARK-8898
 URL: https://issues.apache.org/jira/browse/SPARK-8898
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.4.0
 Environment: S3
Reporter: Daniel Darabos


If I have an RDD that reads from S3 ({{newAPIHadoopFile}}), and try to write 
this to S3 ({{saveAsNewAPIHadoopFile}}), it hangs if I have more than 1 core 
per executor.

It sounds like a race condition, but so far I have seen it trigger 100% of the 
time. From a race for taking a limited number of connections I would expect it 
to succeed at least on 1 task at least some of the time. But I never saw a 
single completed task, except when running with 1-core executors.

All executor threads hang with one of the following two stack traces:

{noformat:title=Stack trace 1}
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x0007759cae70 (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
- locked 0x0007759cae70 (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416)
at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at 
org.jets3t.service.impl.rest.httpclient.RestS3Service.performRequest(RestS3Service.java:342)
at 
org.jets3t.service.impl.rest.httpclient.RestS3Service.performRestHead(RestS3Service.java:718)
at 
org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectImpl(RestS3Service.java:1599)
at 
org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectDetailsImpl(RestS3Service.java:1535)
at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1987)
at org.jets3t.service.S3Service.getObjectDetails(S3Service.java:1332)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:107)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
at org.apache.hadoop.fs.s3native.$Proxy8.retrieveMetadata(Unknown 
Source)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:414)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1332)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:341)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:851)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:832)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:731)
at 
org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1030)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1014)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

{noformat:title=Stack trace 2}
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on 0x0007759cae70 (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:518)
- locked 0x0007759cae70 (a 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$ConnectionPool)
at 

[jira] [Created] (SPARK-8902) Hostname missing in spark-ec2 error message

2015-07-08 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-8902:
-

 Summary: Hostname missing in spark-ec2 error message
 Key: SPARK-8902
 URL: https://issues.apache.org/jira/browse/SPARK-8902
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Trivial


When SSH to a machine fails, the following message is printed:

{noformat}
Failed to SSH to remote host {0}.
Please check that you have provided the correct --identity-file and --key-pair 
parameters and try again.
{noformat}

The intention is to print the host name, but instead {\{0\}} is printed.

I have a pull request for this: https://github.com/apache/spark/pull/7288



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8902) Hostname missing in spark-ec2 error message

2015-07-08 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-8902:
--
Description: 
When SSH to a machine fails, the following message is printed:

{noformat}
Failed to SSH to remote host {0}.
Please check that you have provided the correct --identity-file and --key-pair 
parameters and try again.
{noformat}

The intention is to print the host name, but instead {0} is printed.

I have a pull request for this: https://github.com/apache/spark/pull/7288

  was:
When SSH to a machine fails, the following message is printed:

{noformat}
Failed to SSH to remote host {0}.
Please check that you have provided the correct --identity-file and --key-pair 
parameters and try again.
{noformat}

The intention is to print the host name, but instead {\{0\}} is printed.

I have a pull request for this: https://github.com/apache/spark/pull/7288


 Hostname missing in spark-ec2 error message
 ---

 Key: SPARK-8902
 URL: https://issues.apache.org/jira/browse/SPARK-8902
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Trivial

 When SSH to a machine fails, the following message is printed:
 {noformat}
 Failed to SSH to remote host {0}.
 Please check that you have provided the correct --identity-file and 
 --key-pair parameters and try again.
 {noformat}
 The intention is to print the host name, but instead {0} is printed.
 I have a pull request for this: https://github.com/apache/spark/pull/7288



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8836) Sorted join

2015-07-06 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-8836:
-

 Summary: Sorted join
 Key: SPARK-8836
 URL: https://issues.apache.org/jira/browse/SPARK-8836
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Minor


In my [Spark Summit 2015 
presentation|https://spark-summit.org/2015/events/interactive-graph-analytics-with-spark/]
 I touted sorted joins. It would be a shame to talk about how great they are 
and then not try to introduce them into Spark.

When joining co-partitioned RDDs, the current Spark implementation builds a map 
of the contents of one partition and looks up the items from the other 
partition. 
(https://github.com/apache/spark/blob/v1.4.0/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala
 using AppendOnlyMap.)

Another option for lining up the keys from the two partitions is to sort them 
both and then merge. Just doing this may already be a performance improvement.

But what we do is we sort the partitions up-front, and then enjoy the benefits 
over many operations. Our joins are 10x faster than normal Spark joins and 
don't trigger GC. The hash-based join builds a large hashmap (the size of the 
partition) while the sorted join does not allocate any memory. The sorted 
partitions also benefit other operations, such as distinct, where we also avoid 
building a hashmap. (I think the logic is similar to sort-based shuffle, just 
at a later stage of the process.)

Our implementation is based on zipPartitions, and this is entirely workable. We 
have a custom RDD subclass (SortedRDD) and it overrides a bunch of methods. We 
have an implicit class that adds a toSortedRDD method on pair-RDDs.

But I think integrating this into Spark could take it a step further. What we 
have not investigated is cases where the sorting could be skipped. For example 
when an RDD came out of a sort-based shuffle, its partitions will be sorted, 
right? So even if the user never asks for the partitions to be sorted, they can 
become so, and the faster sorted implementations of join, distinct, etc could 
kick in automatically. This would speed up applications without any change in 
their code.

Instead of a subclass it would probably be best to do this with a simple 
hasSortedPartitions variable in the RDD. Then perhaps operations could have a 
preservesPartitionOrder parameter, like it is done with partitioner and 
preservesPartitioning now. (For example filter(), mapValues(), join(), and 
distinct() all keep the partition sorted.)

What do you think about all this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException

2015-07-02 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611960#comment-14611960
 ] 

Daniel Darabos commented on SPARK-5945:
---

At the moment we have a ton of these infinite retries. A stage is retried a few 
dozen times, then its parent goes missing and Spark starts retrying the parent 
until it also goes missing... We are still debugging the cause of our fetch 
failures, but I just wanted to mention that if there were a 
{{spark.stage.maxFailures}} option, we would be setting it to 1 at this point.

Thanks for all the work on this bug. Even if it's not fixed yet, it's very 
informative.

 Spark should not retry a stage infinitely on a FetchFailedException
 ---

 Key: SPARK-5945
 URL: https://issues.apache.org/jira/browse/SPARK-5945
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid
Assignee: Ilya Ganelin

 While investigating SPARK-5928, I noticed some very strange behavior in the 
 way spark retries stages after a FetchFailedException.  It seems that on a 
 FetchFailedException, instead of simply killing the task and retrying, Spark 
 aborts the stage and retries.  If it just retried the task, the task might 
 fail 4 times and then trigger the usual job killing mechanism.  But by 
 killing the stage instead, the max retry logic is skipped (it looks to me 
 like there is no limit for retries on a stage).
 After a bit of discussion with Kay Ousterhout, it seems the idea is that if a 
 fetch fails, we assume that the block manager we are fetching from has 
 failed, and that it will succeed if we retry the stage w/out that block 
 manager.  In that case, it wouldn't make any sense to retry the task, since 
 its doomed to fail every time, so we might as well kill the whole stage.  But 
 this raises two questions:
 1) Is it really safe to assume that a FetchFailedException means that the 
 BlockManager has failed, and ti will work if we just try another one?  
 SPARK-5928 shows that there are at least some cases where that assumption is 
 wrong.  Even if we fix that case, this logic seems brittle to the next case 
 we find.  I guess the idea is that this behavior is what gives us the R in 
 RDD ... but it seems like its not really that robust and maybe should be 
 reconsidered.
 2) Should stages only be retried a limited number of times?  It would be 
 pretty easy to put in a limited number of retries per stage.  Though again, 
 we encounter issues with keeping things resilient.  Theoretically one stage 
 could have many retries, but due to failures in different stages further 
 downstream, so we might need to track the cause of each retry as well to 
 still have the desired behavior.
 In general it just seems there is some flakiness in the retry logic.  This is 
 the only reproducible example I have at the moment, but I vaguely recall 
 hitting other cases of strange behavior w/ retries when trying to run long 
 pipelines.  Eg., if one executor is stuck in a GC during a fetch, the fetch 
 fails, but the executor eventually comes back and the stage gets retried 
 again, but the same GC issues happen the second time around, etc.
 Copied from SPARK-5928, here's the example program that can regularly produce 
 a loop of stage failures.  Note that it will only fail from a remote fetch, 
 so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell 
 --num-executors 2 --executor-memory 4000m}}
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4357) Modify release publishing to work with Scala 2.11

2015-06-30 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608150#comment-14608150
 ] 

Daniel Darabos commented on SPARK-4357:
---

Hi, what is the obstacle to providing pre-built downloads for Scala 2.11 now? 
Is it SPARK-6154 (JDBC server not working)? Or is it just a matter of building 
the archives and adding them to the download page?

Is there something we could help with? The lack of thread-safe reflection in 
Scala 2.10 is a constant source of annoyance to us, but we are hesitant to 
trust our own Spark builds all the way. Thanks!

 Modify release publishing to work with Scala 2.11
 -

 Key: SPARK-4357
 URL: https://issues.apache.org/jira/browse/SPARK-4357
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 We'll need to do some effort to make our publishing work with 2.11 since the 
 current pipeline assumes a single set of artifacts is published.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8310) Spark EC2 branch in 1.4 is wrong

2015-06-29 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605411#comment-14605411
 ] 

Daniel Darabos commented on SPARK-8310:
---

It's an easy mistake to make, and one of the few things that are not covered by 
the release candidate process. We tested the release candidate on EC2, but we 
had to specifically override the version, since at that point there was no 
released 1.4.0. I have no idea how this could be avoided for future releases.

 Spark EC2 branch in 1.4 is wrong
 

 Key: SPARK-8310
 URL: https://issues.apache.org/jira/browse/SPARK-8310
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical
 Fix For: 1.4.1, 1.5.0


 It points to `branch-1.3` of spark-ec2 right now while it should point to 
 `branch-1.4`
 cc [~brdwrd] [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6944) Mechanism to associate generic operator scope with RDD's

2015-05-28 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14562823#comment-14562823
 ] 

Daniel Darabos commented on SPARK-6944:
---

Thanks for this fantastic feature! Please tell me this is going to be made part 
of the public API. We have another layer (or two) of lazy operations built on 
top of Spark. It would be fantastic for debugging if we could tag the RDD 
operations with the higher-level unit names.

If there is extra work involved with making this public I'd be very happy to 
help!

 Mechanism to associate generic operator scope with RDD's
 

 Key: SPARK-6944
 URL: https://issues.apache.org/jira/browse/SPARK-6944
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Patrick Wendell
Assignee: Andrew Or
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6497) Class is not registered: scala.reflect.ManifestFactory$$anon$9

2015-03-24 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-6497:
-

 Summary: Class is not registered: 
scala.reflect.ManifestFactory$$anon$9
 Key: SPARK-6497
 URL: https://issues.apache.org/jira/browse/SPARK-6497
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Daniel Darabos
Priority: Minor


This is a slight regression from Spark 1.2.1 to 1.3.0.

{noformat}
spark-1.2.1-bin-hadoop2.4/bin/spark-shell --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.kryo.registrationRequired=true --conf 
'spark.kryo.classesToRegister=scala.collection.mutable.WrappedArray$ofRef,[Lscala.Tuple2;'
scala sc.parallelize(Seq(1 - 1)).groupByKey.collect
res0: Array[(Int, Iterable[Int])] = Array((1,CompactBuffer(1)))
{noformat}

{noformat}
spark-1.3.0-bin-hadoop2.4/bin/spark-shell --conf 
spark.serializer=org.apache.spark.serializer.KryoSerializer --conf 
spark.kryo.registrationRequired=true --conf 
'spark.kryo.classesToRegister=scala.collection.mutable.WrappedArray$ofRef,[Lscala.Tuple2;'
scala sc.parallelize(Seq(1 - 1)).groupByKey.collect
Lost task 1.0 in stage 3.0 (TID 25, localhost): 
com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
Class is not registered: scala.reflect.ManifestFactory$$anon$9
Note: To register this class use: 
kryo.register(scala.reflect.ManifestFactory$$anon$9.class);
Serialization trace:
evidence$1 (org.apache.spark.util.collection.CompactBuffer)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
at com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:318)
at 
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.write(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at 
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:161)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Class is not registered: 
scala.reflect.ManifestFactory$$anon$9
Note: To register this class use: 
kryo.register(scala.reflect.ManifestFactory$$anon$9.class);
at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:561)
... 13 more
{noformat}

In our production code the exception is actually about 
{{scala.reflect.ManifestFactory$$anon$8}} instead of 
{{scala.reflect.ManifestFactory$$anon$9}} but it's probably the same thing. Any 
idea what caused from 1.2.1 to 1.3.0 that could be causing this?

We also get exceptions in 1.3.0 for {{scala.reflect.ClassTag$$anon$1}} and 
{{java.lang.Class}}, but I haven't reduced them to a spark-shell reproduction 
yet. We can of course just register these classes ourselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5102) CompressedMapStatus needs to be registered with Kryo

2015-01-06 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-5102:
-

 Summary: CompressedMapStatus needs to be registered with Kryo
 Key: SPARK-5102
 URL: https://issues.apache.org/jira/browse/SPARK-5102
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Daniel Darabos
Priority: Minor


After upgrading from Spark 1.1.0 to 1.2.0 I got this exception:

{code}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
0.0 (TID 0, localhost): java.lang.IllegalArgumentException: Class is not 
registered: org.apache.spark.scheduler.CompressedMapStatus
Note: To register this class use: 
kryo.register(org.apache.spark.scheduler.CompressedMapStatus.class);
at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
at 
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:165)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:206)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

I had to register {{org.apache.spark.scheduler.CompressedMapStatus}} with Kryo. 
I think this should be done in {{spark/serializer/KryoSerializer.scala}}, 
unless instances of this class are not expected to be sent over the wire. 
(Maybe I'm doing something wrong?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4831) Current directory always on classpath with spark-submit

2014-12-11 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-4831:
-

 Summary: Current directory always on classpath with spark-submit
 Key: SPARK-4831
 URL: https://issues.apache.org/jira/browse/SPARK-4831
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.1.1, 1.2.0
Reporter: Daniel Darabos
Priority: Minor


We had a situation where we were launching an application with spark-submit, 
and a file (play.plugins) was on the classpath twice, causing problems (trying 
to register plugins twice). Upon investigating how it got on the classpath 
twice, we found that it was present in one of our jars, and also in the current 
working directory. But the one in the current working directory should not be 
on the classpath. We never asked spark-submit to put the current directory on 
the classpath.

I think this is caused by a line in 
[compute-classpath.sh|https://github.com/apache/spark/blob/v1.2.0-rc2/bin/compute-classpath.sh#L28]:

{code}
CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH
{code}

Now if SPARK_CLASSPATH is empty, the empty string is added to the classpath, 
which means the current working directory.

We tried setting SPARK_CLASSPATH to a bogus value, but that is [not 
allowed|https://github.com/apache/spark/blob/v1.2.0-rc2/core/src/main/scala/org/apache/spark/SparkConf.scala#L312].

What is the right solution? Only add SPARK_CLASSPATH if it's non-empty? I can 
send a pull request for that I think. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4831) Current directory always on classpath with spark-submit

2014-12-11 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243074#comment-14243074
 ] 

Daniel Darabos commented on SPARK-4831:
---

bq. Is it perhaps finding and exploded directory of classes?

Yes, that is exactly the situation. One instance of the file is in a jar, 
another is just there (free-floating) in the directory. It is a configuration 
file. (Actually it's in a conf directory, but Play looks for both 
play.plugins and conf/play.plugins with getResources in the classpath. So 
it finds the copy inside the generated jar, also in the conf directory of the 
project. We can of course work around this in numerous ways.)

I think there is no reason for spark-submit to add an empty entry to the 
classpath. It will just lead to accidents like ours. If the user wants to add 
an empty entry, they can easily do so.

I've sent https://github.com/apache/spark/pull/3678 as a possible fix. Thanks 
for investigating!

 Current directory always on classpath with spark-submit
 ---

 Key: SPARK-4831
 URL: https://issues.apache.org/jira/browse/SPARK-4831
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.1.1, 1.2.0
Reporter: Daniel Darabos
Priority: Minor

 We had a situation where we were launching an application with spark-submit, 
 and a file (play.plugins) was on the classpath twice, causing problems 
 (trying to register plugins twice). Upon investigating how it got on the 
 classpath twice, we found that it was present in one of our jars, and also in 
 the current working directory. But the one in the current working directory 
 should not be on the classpath. We never asked spark-submit to put the 
 current directory on the classpath.
 I think this is caused by a line in 
 [compute-classpath.sh|https://github.com/apache/spark/blob/v1.2.0-rc2/bin/compute-classpath.sh#L28]:
 {code}
 CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH
 {code}
 Now if SPARK_CLASSPATH is empty, the empty string is added to the classpath, 
 which means the current working directory.
 We tried setting SPARK_CLASSPATH to a bogus value, but that is [not 
 allowed|https://github.com/apache/spark/blob/v1.2.0-rc2/core/src/main/scala/org/apache/spark/SparkConf.scala#L312].
 What is the right solution? Only add SPARK_CLASSPATH if it's non-empty? I can 
 send a pull request for that I think. Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3721) Broadcast Variables above 2GB break in PySpark

2014-10-27 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14185189#comment-14185189
 ] 

Daniel Darabos commented on SPARK-3721:
---

We are hitting all kinds of MaxInt and array size limits when broadcasting a 
12GB beast from Scala too.


 Broadcast Variables above 2GB break in PySpark
 --

 Key: SPARK-3721
 URL: https://issues.apache.org/jira/browse/SPARK-3721
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Brad Miller
Assignee: Davies Liu

 The bug displays 3 unique failure modes in PySpark, all of which seem to be 
 related to broadcast variable size. Note that the tests below ran python 
 2.7.3 on all machines and used the Spark 1.1.0 binaries.
 **BLOCK 1** [no problem]
 {noformat}
 import cPickle
 from pyspark import SparkContext
 def check_pre_serialized(size):
 msg = cPickle.dumps(range(2 ** size))
 print 'serialized length:', len(msg)
 bvar = sc.broadcast(msg)
 print 'length recovered from broadcast variable:', len(bvar.value)
 print 'correct value recovered:', msg == bvar.value
 bvar.unpersist()
 def check_unserialized(size):
 msg = range(2 ** size)
 bvar = sc.broadcast(msg)
 print 'correct value recovered:', msg == bvar.value
 bvar.unpersist()
 SparkContext.setSystemProperty('spark.executor.memory', '15g')
 SparkContext.setSystemProperty('spark.cores.max', '5')
 sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
 'broadcast_bug')
 {noformat}
 **BLOCK 2**  [no problem]
 {noformat}
 check_pre_serialized(20)
  serialized length: 9374656
  length recovered from broadcast variable: 9374656
  correct value recovered: True
 {noformat}
 **BLOCK 3**  [no problem]
 {noformat}
 check_unserialized(20)
  correct value recovered: True
 {noformat}
 **BLOCK 4**  [no problem]
 {noformat}
 check_pre_serialized(27)
  serialized length: 1499501632
  length recovered from broadcast variable: 1499501632
  correct value recovered: True
 {noformat}
 **BLOCK 5**  [no problem]
 {noformat}
 check_unserialized(27)
  correct value recovered: True
 {noformat}
 **BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
 sc.broadcast]**
 {noformat}
 check_pre_serialized(28)
 .
  /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
  354
  355 def dumps(self, obj):
  -- 356 return cPickle.dumps(obj, 2)
  357
  358 loads = cPickle.loads
 
  SystemError: error return without exception set
 {noformat}
 **BLOCK 7**  [no problem]
 {noformat}
 check_unserialized(28)
  correct value recovered: True
 {noformat}
 **BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
 {noformat}
 check_pre_serialized(29)
  serialized length: 6331339840
  length recovered from broadcast variable: 2036372544
  correct value recovered: False
 {noformat}
 **BLOCK 9**  **[ERROR 3: unhandled error from zlib.compress inside 
 sc.broadcast]**
 {noformat}
 check_unserialized(29)
 ..
  /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
  418 
  419 def dumps(self, obj):
  -- 420 return zlib.compress(self.serializer.dumps(obj), 1)
  421 
  422 def loads(self, obj):
  
  OverflowError: size does not fit in an int
 {noformat}
 **BLOCK 10**  [ERROR 1]
 {noformat}
 check_pre_serialized(30)
 ...same as above...
 {noformat}
 **BLOCK 11**  [ERROR 3]
 {noformat}
 check_unserialized(30)
 ...same as above...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2014-10-09 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14165364#comment-14165364
 ] 

Daniel Darabos commented on SPARK-3644:
---

Hi Dev, thanks for the offer! Have you seen Kousuke's PR? 
https://github.com/apache/spark/pull/2333 seems to cover a lot of ground. Maybe 
he or the reviewers there can tell you how to make yourself useful!

Unrelatedly, I wanted to mention that you can disregard my earlier comments. We 
cannot use XHR on these endpoints, since a different port means a different 
security domain. And anyway it turned out to be really easy to use a custom 
SparkListener for what we wanted to do. Sorry for the noise.

 REST API for Spark application info (jobs / stages / tasks / storage info)
 --

 Key: SPARK-3644
 URL: https://issues.apache.org/jira/browse/SPARK-3644
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Reporter: Josh Rosen

 This JIRA is a forum to draft a design proposal for a REST interface for 
 accessing information about Spark applications, such as job / stage / task / 
 storage status.
 There have been a number of proposals to serve JSON representations of the 
 information displayed in Spark's web UI.  Given that we might redesign the 
 pages of the web UI (and possibly re-implement the UI as a client of a REST 
 API), the API endpoints and their responses should be independent of what we 
 choose to display on particular web UI pages / layouts.
 Let's start a discussion of what a good REST API would look like from 
 first-principles.  We can discuss what urls / endpoints expose access to 
 data, how our JSON responses will be formatted, how fields will be named, how 
 the API will be documented and tested, etc.
 Some links for inspiration:
 https://developer.github.com/v3/
 http://developer.netflix.com/docs/REST_API_Reference
 https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2014-10-06 Thread Daniel Darabos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14160183#comment-14160183
 ] 

Daniel Darabos commented on SPARK-3644:
---

Notes from another consumer with a different use-case: We have an analytics 
program with a web UI and a backend that uses Spark. For us the following 
features of the REST API are important:

 - Most importantly, be able to tell if Spark is idle or not. In other words, 
get a list of active stages.
 - Track the progress of a stage. A calculation can take any number of stages, 
so we cannot make a very good progress indicator, but this would still tell the 
user if there is any sort of progress being made. I think your pull request 
covers this.
 - Kill a stage, or all active stages. I think this feature already exists.

With this we could add a small UI element that queries the Spark status by an 
XHR request and gives our users some visibility into the system. (Sure, they 
could just go to port 4040, but that's too complex for them.)

Thanks for working on this!

 REST API for Spark application info (jobs / stages / tasks / storage info)
 --

 Key: SPARK-3644
 URL: https://issues.apache.org/jira/browse/SPARK-3644
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Reporter: Josh Rosen

 This JIRA is a forum to draft a design proposal for a REST interface for 
 accessing information about Spark applications, such as job / stage / task / 
 storage status.
 There have been a number of proposals to serve JSON representations of the 
 information displayed in Spark's web UI.  Given that we might redesign the 
 pages of the web UI (and possibly re-implement the UI as a client of a REST 
 API), the API endpoints and their responses should be independent of what we 
 choose to display on particular web UI pages / layouts.
 Let's start a discussion of what a good REST API would look like from 
 first-principles.  We can discuss what urls / endpoints expose access to 
 data, how our JSON responses will be formatted, how fields will be named, how 
 the API will be documented and tested, etc.
 Some links for inspiration:
 https://developer.github.com/v3/
 http://developer.netflix.com/docs/REST_API_Reference
 https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >