Today's fax

2016-08-22 Thread Robin




IMG_1462.DOCM
Description: IMG_1462.DOCM

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-19 Thread Robin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin updated SPARK-37690:
--
Description: 
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python} from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python} 
AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  

  was:
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python} 
from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python} 
AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  


> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
>     Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python} from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python} 
> AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-19 Thread Robin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin updated SPARK-37690:
--
Description: 
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python} 
from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python} 
AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  

  was:
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python} from pyspark.context import SparkContext from pyspark.sql import 
SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ 
SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) 
df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = 
spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ 
df = spark.sql(sql) {code}   The following error is now produced:   
{code:python} AnalysisException: Recursive view `df` detected (cycle: `df` -> 
`df`) {code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  


> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
>     Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python} 
> from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python} 
> AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-19 Thread Robin (Jira)
Robin created SPARK-37690:
-

 Summary: Recursive view `df` detected (cycle: `df` -> `df`)
 Key: SPARK-37690
 URL: https://issues.apache.org/jira/browse/SPARK-37690
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
Reporter: Robin


In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python} from pyspark.context import SparkContext from pyspark.sql import 
SparkSession sc = SparkContext.getOrCreate() spark = SparkSession(sc) sql = """ 
SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ df = spark.sql(sql) 
df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ df = 
spark.sql(sql) df.createOrReplaceTempView("df") sql = """ SELECT * FROM df """ 
df = spark.sql(sql) {code}   The following error is now produced:   
{code:python} AnalysisException: Recursive view `df` detected (cycle: `df` -> 
`df`) {code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-19 Thread Robin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin updated SPARK-37690:
--
Description: 
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python}from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
`df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  

  was:
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python}from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 
df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
`df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  


> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
>     Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python}from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-19 Thread Robin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin updated SPARK-37690:
--
Description: 
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python}from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 
df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
`df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  

  was:
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python}from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
`df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  


> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
>     Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python}from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-19 Thread Robin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin updated SPARK-37690:
--
Description: 
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python}from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
`df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  

  was:
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python}from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python} 
AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  


> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
>     Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python}from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-19 Thread Robin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin updated SPARK-37690:
--
Description: 
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python}from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python} 
AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  

  was:
In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
This change is backwards incompatible, and means a common way of running 
pipelines of SQL queries no longer works.   The following is a simple 
reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 

{code:python} from pyspark.context import SparkContext 
from pyspark.sql import SparkSession 
sc = SparkContext.getOrCreate() 
spark = SparkSession(sc) 
sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 

df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ df = spark.sql(sql) 

df.createOrReplaceTempView("df") 

sql = """ SELECT * FROM df """ 

df = spark.sql(sql) {code}   

The following error is now produced:   
{code:python} 
AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
{code} 

I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a lot 
of legacy code, and the `createOrReplaceTempView` method is named explicitly 
such that replacing an existing view should be allowed.   An internet search 
suggests other users have run into a similar problems, e.g. 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
  


> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
>     Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python}from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python} 
> AnalysisException: Recursive view `df` detected (cycle: `df` -> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37690) Recursive view `df` detected (cycle: `df` -> `df`)

2021-12-23 Thread Robin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-37690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17464354#comment-17464354
 ] 

Robin commented on SPARK-37690:
---

Someone 
[here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
 has suggested this is an intentional breaking change introduced in Spark 3.1:

>From [Migration Guide: SQL, Datasets and DataFrame - Spark 3.1.1 Documentation 
>(apache.org)|https://spark.apache.org/docs/3.1.1/sql-migration-guide.html]]

> In Spark 3.1, the temporary view will have same behaviors with the permanent 
> view, i.e. capture and store runtime SQL configs, SQL text, catalog and 
> namespace. The capatured view properties will be applied during the parsing 
> and analysis phases of the view resolution. To restore the behavior before 
> Spark 3.1, {*}you can set spark.sql.legacy.storeAnalyzedPlanForView to 
> true{*}.

 

Grateful if someone could clarify.  Worth noting that the example code works in 
Spark 3.1.2, just not 3.2.0.  It's not obvious to me the above quote implies 
`createOrReplaceTempView` would fail in the example code posted in the issue.

> Recursive view `df` detected (cycle: `df` -> `df`)
> --
>
> Key: SPARK-37690
> URL: https://issues.apache.org/jira/browse/SPARK-37690
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>    Affects Versions: 3.2.0
>Reporter: Robin
>Priority: Major
>
> In Spark 3.2.0, you can no longer reuse the same name for a temporary view.  
> This change is backwards incompatible, and means a common way of running 
> pipelines of SQL queries no longer works.   The following is a simple 
> reproducible example that works in Spark 2.x and 3.1.2, but not in 3.2.0: 
> {code:python}from pyspark.context import SparkContext 
> from pyspark.sql import SparkSession 
> sc = SparkContext.getOrCreate() 
> spark = SparkSession(sc) 
> sql = """ SELECT id as col_1, rand() AS col_2 FROM RANGE(10); """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) 
> df.createOrReplaceTempView("df") 
> sql = """ SELECT * FROM df """ 
> df = spark.sql(sql) {code}   
> The following error is now produced:   
> {code:python}AnalysisException: Recursive view `df` detected (cycle: `df` -> 
> `df`) 
> {code} 
> I'm reasonably sure this change is unintentional in 3.2.0 since it breaks a 
> lot of legacy code, and the `createOrReplaceTempView` method is named 
> explicitly such that replacing an existing view should be allowed.   An 
> internet search suggests other users have run into a similar problems, e.g. 
> [here|https://community.databricks.com/s/question/0D53f1Qugr7CAB/upgrading-from-spark-24-to-32-recursive-view-errors-when-using]
>   



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-04 Thread Robin (Jira)
Robin created SPARK-42346:
-

 Summary: distinct(count colname) with UNION ALL causes query 
analyzer bug
 Key: SPARK-42346
 URL: https://issues.apache.org/jira/browse/SPARK-42346
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.1
Reporter: Robin


If you combine a UNION ALL with a count(distinct colname) you get a query 
analyzer bug.

 

This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.

 

Here is a reprex in PySpark:

{{df_pd = pd.DataFrame([}}
{{    \{'surname': 'a', 'first_name': 'b'}}}
{{])}}
{{df_spark = spark.createDataFrame(df_pd)}}
{{df_spark.createOrReplaceTempView("input_table")}}

{{sql = """}}

{{SELECT }}
{{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
{{        AS distinct_value_count}}
{{FROM   input_table}}
{{UNION ALL}}
{{SELECT }}
{{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
{{        AS distinct_value_count}}
{{FROM   input_table """}}

{{spark.sql(sql).toPandas()}}

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-04 Thread Robin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin updated SPARK-42346:
--
Priority: Major  (was: Minor)

> distinct(count colname) with UNION ALL causes query analyzer bug
> 
>
> Key: SPARK-42346
> URL: https://issues.apache.org/jira/browse/SPARK-42346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Robin
>Priority: Major
>
> If you combine a UNION ALL with a count(distinct colname) you get a query 
> analyzer bug.
>  
> This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.
>  
> Here is a reprex in PySpark:
> {{df_pd = pd.DataFrame([}}
> {{    \{'surname': 'a', 'first_name': 'b'}}}
> {{])}}
> {{df_spark = spark.createDataFrame(df_pd)}}
> {{df_spark.createOrReplaceTempView("input_table")}}
> {{sql = """}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table}}
> {{UNION ALL}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table """}}
> {{spark.sql(sql).toPandas()}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42346) distinct(count colname) with UNION ALL causes query analyzer bug

2023-02-04 Thread Robin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin updated SPARK-42346:
--
Priority: Minor  (was: Major)

> distinct(count colname) with UNION ALL causes query analyzer bug
> 
>
> Key: SPARK-42346
> URL: https://issues.apache.org/jira/browse/SPARK-42346
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Robin
>Priority: Minor
>
> If you combine a UNION ALL with a count(distinct colname) you get a query 
> analyzer bug.
>  
> This behaviour is introduced in 3.3.0.  The bug was not present in 3.2.1.
>  
> Here is a reprex in PySpark:
> {{df_pd = pd.DataFrame([}}
> {{    \{'surname': 'a', 'first_name': 'b'}}}
> {{])}}
> {{df_spark = spark.createDataFrame(df_pd)}}
> {{df_spark.createOrReplaceTempView("input_table")}}
> {{sql = """}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT first_name) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table}}
> {{UNION ALL}}
> {{SELECT }}
> {{    (SELECT Count(DISTINCT surname) FROM   input_table) }}
> {{        AS distinct_value_count}}
> {{FROM   input_table """}}
> {{spark.sql(sql).toPandas()}}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3650) Triangle Count handles reverse edges incorrectly

2015-06-26 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603016#comment-14603016
 ] 

Robin East commented on SPARK-3650:
---

What is the status of this issue? A user on the mailing list just ran into to 
this issue. It looks like PR-2495 should fix the issue. Is there a version that 
is being targeted for the fix?

 Triangle Count handles reverse edges incorrectly
 

 Key: SPARK-3650
 URL: https://issues.apache.org/jira/browse/SPARK-3650
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.1.0, 1.2.0
Reporter: Joseph E. Gonzalez
Priority: Critical

 The triangle count implementation assumes that edges are aligned in a 
 canonical direction.  As stated in the documentation:
 bq. Note that the input graph should have its edges in canonical direction 
 (i.e. the `sourceId` less than `destId`)
 However the TriangleCount algorithm does not verify that this condition holds 
 and indeed even the unit tests exploits this functionality:
 {code:scala}
 val triangles = Array(0L - 1L, 1L - 2L, 2L - 0L) ++
 Array(0L - -1L, -1L - -2L, -2L - 0L)
   val rawEdges = sc.parallelize(triangles, 2)
   val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
   val triangleCount = graph.triangleCount()
   val verts = triangleCount.vertices
   verts.collect().foreach { case (vid, count) =
 if (vid == 0) {
   assert(count === 4)  // -- Should be 2
 } else {
   assert(count === 2) // -- Should be 1
 }
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5692) Model import/export for Word2Vec

2015-07-30 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14647432#comment-14647432
 ] 

Robin East commented on SPARK-5692:
---

Hi the description includes the sentence 'We may want to discuss whether we 
want to be compatible with the original Word2Vec model storage format.'. Was 
this ever discussed - I can't see anything in comment stream for this JIRA. Is 
there any interest in adding functionality to import Word2Vec models from the 
original binary format (e.g. the 300 million word Google News model).

 Model import/export for Word2Vec
 

 Key: SPARK-5692
 URL: https://issues.apache.org/jira/browse/SPARK-5692
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
 Fix For: 1.4.0


 Supoort save and load for Word2VecModel. We may want to discuss whether we 
 want to be compatible with the original Word2Vec model storage format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10228) Integer overflow in VertexRDDImpl.count

2015-08-25 Thread Robin Cheng (JIRA)
Robin Cheng created SPARK-10228:
---

 Summary: Integer overflow in VertexRDDImpl.count
 Key: SPARK-10228
 URL: https://issues.apache.org/jira/browse/SPARK-10228
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.4.1
Reporter: Robin Cheng


VertexRDDImpl overrides RDD.count() but aggregates Int instead of Long:

/** The number of vertices in the RDD. */
  override def count(): Long = {
partitionsRDD.map(_.size).reduce(_ + _)
  }

This causes Pregel to stop iterating when the number of messages is negative, 
giving incorrect results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits

2015-09-14 Thread Robin East (JIRA)
Robin East created SPARK-10598:
--

 Summary: RoutingTablePartition toMessage method refers to bytes 
instead of bits
 Key: SPARK-10598
 URL: https://issues.apache.org/jira/browse/SPARK-10598
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.5.0, 1.4.1, 1.4.0
Reporter: Robin East
Priority: Minor
 Fix For: 1.5.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10598) RoutingTablePartition toMessage method refers to bytes instead of bits

2015-09-14 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744359#comment-14744359
 ] 

Robin East commented on SPARK-10598:


Apologies - have checked it out. You're referring to Fix and Target Version 
fields right?

> RoutingTablePartition toMessage method refers to bytes instead of bits
> --
>
> Key: SPARK-10598
> URL: https://issues.apache.org/jira/browse/SPARK-10598
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Robin East
>Assignee: Robin East
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6357) Add unapply in EdgeContext

2015-09-17 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804664#comment-14804664
 ] 

Robin East commented on SPARK-6357:
---

[~maropu] Looks like the PR was merged. Does that mean this JIRA can be closed?

> Add unapply in EdgeContext
> --
>
> Key: SPARK-6357
> URL: https://issues.apache.org/jira/browse/SPARK-6357
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Takeshi Yamamuro
>
> This extractor is mainly used for Graph#aggregateMessages*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9429) TriangleCount: job aborted due to stage failure

2015-09-17 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804678#comment-14804678
 ] 

Robin East commented on SPARK-9429:
---

The scala docs for triangleCount state 'Note that the input graph should have 
its edges in canonical direction (i.e. the sourceId less than destId). Also the 
graph must have been partitioned using 
org.apache.spark.graphx.Graph#partitionBy.'. The code checks for this condition 
and throws an assertion when the conditions are not met e.g. you have an 
Edge(2L,1L,...) which is not in canonical direction.

> TriangleCount: job aborted due to stage failure
> ---
>
> Key: SPARK-9429
> URL: https://issues.apache.org/jira/browse/SPARK-9429
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: YangBaoxing
>
> Hi, all !
> When I run the TriangleCount algorithm on my own data, an exception like "Job 
> aborted to stage failure: Task 0 in stage 4.0 failed 1 times, most recent 
> failure: Lost task 0.0 in stage 4.0 (TID 8, localhost): 
> java.lang.AssertionError: assertion failed" occurred. Then I checked the 
> source code and found that the problem is in line "assert((dblCount & 1) == 
> 0)". And I also found that it run successfully on Array(0L -> 1L, 1L -> 2L, 
> 2L -> 0L) and Array(0L -> 1L, 1L -> 2L, 2L -> 0L, 0L -> 2L, 2L -> 1L, 1L -> 
> 0L) while failed on Array(0L -> 1L, 1L -> 2L, 2L -> 0L, 2L -> 1L). It seems 
> to be more suitable for all unidirectional or bidirectional graph. Is 
> TriangleCount suitable for incomplete bidirectional graph? The complete 
> exception as follows:
> Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 4.0 (TID 8, localhost): 
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:90)
>   at 
> org.apache.spark.graphx.lib.TriangleCount$$anonfun$7.apply(TriangleCount.scala:87)
>   at 
> org.apache.spark.graphx.impl.VertexPartitionBaseOps.leftJoin(VertexPartitionBaseOps.scala:140)
>   at 
> org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:159)
>   at 
> org.apache.spark.graphx.impl.VertexRDDImpl$$anonfun$3.apply(VertexRDDImpl.scala:156)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at org.apache.spark.graphx.VertexRDD.compute(VertexRDD.scala:71)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13431) Maven build fails due to: Method code too large! in Catalyst

2016-02-23 Thread Robin Aly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159109#comment-15159109
 ] 

Robin Aly commented on SPARK-13431:
---

I experiencing the same problem when checking out 1.6.0 (git clone 
g...@github.com:apache/spark.git  v1.6.0).

[INFO] Excluding org.apache.ivy:ivy:jar:2.4.0 from the shaded jar.
[INFO] Excluding oro:oro:jar:2.0.8 from the shaded jar.
[INFO] Excluding net.razorvine:pyrolite:jar:4.9 from the shaded jar.
[INFO] Excluding net.sf.py4j:py4j:jar:0.9.1 from the shaded jar.
[INFO] Excluding org.apache.spark:spark-unsafe_2.11:jar:2.0.0-SNAPSHOT from the 
shaded jar.
[INFO] Excluding org.codehaus.janino:janino:jar:2.7.8 from the shaded jar.
[INFO] Excluding org.codehaus.janino:commons-compiler:jar:2.7.8 from the shaded 
jar.
[INFO] Excluding org.antlr:antlr-runtime:jar:3.5.2 from the shaded jar.
[INFO] Excluding commons-codec:commons-codec:jar:1.10 from the shaded jar.
[INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded jar.
[INFO] Excluding org.scala-lang.modules:scala-xml_2.11:jar:1.0.2 from the 
shaded jar.
[INFO] 
[INFO] Reactor Summary:
[INFO] 
[INFO] Spark Project Parent POM ... SUCCESS [  7.063 s]
[INFO] Spark Project Sketch ... SUCCESS [  7.766 s]
[INFO] Spark Project Test Tags  SUCCESS [  3.325 s]
[INFO] Spark Project Core . SUCCESS [03:13 min]
[INFO] Spark Project GraphX ... SUCCESS [ 23.761 s]
[INFO] Spark Project ML Library ... SUCCESS [01:52 min]
[INFO] Spark Project Tools  SUCCESS [  4.876 s]
[INFO] Spark Project Networking ... SUCCESS [ 16.129 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [ 13.367 s]
[INFO] Spark Project Streaming  SUCCESS [ 53.482 s]
[INFO] Spark Project Catalyst . FAILURE [02:15 min]
[INFO] Spark Project SQL .. SKIPPED
[INFO] Spark Project Hive . SKIPPED
[INFO] Spark Project Docker Integration Tests . SKIPPED
[INFO] Spark Project Unsafe ... SKIPPED
[INFO] Spark Project Assembly . SKIPPED
[INFO] Spark Project External Twitter . SKIPPED
[INFO] Spark Project External Flume ... SKIPPED
[INFO] Spark Project External Flume Sink .. SKIPPED
[INFO] Spark Project External Flume Assembly .. SKIPPED
[INFO] Spark Project External Akka  SKIPPED
[INFO] Spark Project External MQTT  SKIPPED
[INFO] Spark Project External MQTT Assembly ... SKIPPED
[INFO] Spark Project External ZeroMQ .. SKIPPED
[INFO] Spark Project Examples . SKIPPED
[INFO] Spark Project REPL . SKIPPED
[INFO] Spark Project Launcher . SKIPPED
[INFO] Spark Project External Kafka ... SKIPPED
[INFO] Spark Project External Kafka Assembly .. SKIPPED
[INFO] Spark Project YARN . SKIPPED
[INFO] Spark Project YARN Shuffle Service . SKIPPED
[INFO] 
[INFO] BUILD FAILURE
[INFO] 
[INFO] Total time: 09:34 min
[INFO] Finished at: 2016-02-23T17:04:28+01:00
[INFO] Final Memory: 65M/1297M
[INFO] 
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-shade-plugin:2.4.3:shade (default) on project 
spark-catalyst_2.10: Error creating shaded jar: Method code too large! -> [Help 
1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException

> Maven build fails due to: Method code too large! in Catalyst
> 
>
> Key: SPARK-13431
> URL: https://issues.apache.org/jira/browse/SPARK-13431
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>Priority: Blocker
>
> Cannot build the project when run the normal build command

[jira] [Commented] (SPARK-13431) Maven build fails due to: Method code too large! in Catalyst

2016-02-23 Thread Robin Aly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159129#comment-15159129
 ] 

Robin Aly commented on SPARK-13431:
---

Sorry for the spam. I only thought I report it as the bug is only filed as 
affecting v2.0.0 (and I am using v1.6.0)

> Maven build fails due to: Method code too large! in Catalyst
> 
>
> Key: SPARK-13431
> URL: https://issues.apache.org/jira/browse/SPARK-13431
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Stavros Kontopoulos
>Priority: Blocker
>
> Cannot build the project when run the normal build commands:
> eg.
> {code}
> build/mvn -Phadoop-2.6 -Dhadoop.version=2.6.0  clean package
> ./make-distribution.sh --name test --tgz -Phadoop-2.6 
> {code}
> Integration builds are also failing: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/229/console
> https://ci.typesafe.com/job/mit-docker-test-zk-ref/12/console
> It looks like this is the commit that introduced the issue:
> https://github.com/apache/spark/commit/7925071280bfa1570435bde3e93492eaf2167d56



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3650) Triangle Count handles reverse edges incorrectly

2016-02-18 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15152327#comment-15152327
 ] 

Robin East commented on SPARK-3650:
---

I did ask if the PR could be revived but never followed up on it. If I get a 
moment I'll try and submit the PR myself however have been a little busy on 
other GraphX things.

By the way there is a workaround to the issue which is to make sure your edges 
are in the canonical direction before calling triangleCount.

> Triangle Count handles reverse edges incorrectly
> 
>
> Key: SPARK-3650
> URL: https://issues.apache.org/jira/browse/SPARK-3650
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Joseph E. Gonzalez
>Priority: Critical
>
> The triangle count implementation assumes that edges are aligned in a 
> canonical direction.  As stated in the documentation:
> bq. Note that the input graph should have its edges in canonical direction 
> (i.e. the `sourceId` less than `destId`)
> However the TriangleCount algorithm does not verify that this condition holds 
> and indeed even the unit tests exploits this functionality:
> {code:scala}
> val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++
> Array(0L -> -1L, -1L -> -2L, -2L -> 0L)
>   val rawEdges = sc.parallelize(triangles, 2)
>   val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
>   val triangleCount = graph.triangleCount()
>   val verts = triangleCount.vertices
>   verts.collect().foreach { case (vid, count) =>
> if (vid == 0) {
>   assert(count === 4)  // <-- Should be 2
> } else {
>   assert(count === 2) // <-- Should be 1
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10945) GraphX computes Pagerank with NaN (with some datasets)

2016-02-22 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156641#comment-15156641
 ] 

Robin East commented on SPARK-10945:


[~ankurd] Did you get a chance to look at this?

> GraphX computes Pagerank with NaN (with some datasets)
> --
>
> Key: SPARK-10945
> URL: https://issues.apache.org/jira/browse/SPARK-10945
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.3.0
> Environment: Linux
>Reporter: Khaled Ammar
>  Labels: test
>
> Hi,
> I run GraphX in a medium size standalone Spark 1.3.0 installation. The 
> pagerank typically works fine, except with one dataset (Twitter: 
> http://law.di.unimi.it/webdata/twitter-2010). This is a public dataset that 
> is commonly used in research papers.
> I found that many vertices have an NaN values. This is true, even if the 
> algorithm run for 1 iteration only.  
> Thanks,
> -Khaled



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6808) Checkpointing after zipPartitions results in NODE_LOCAL execution

2016-02-22 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156599#comment-15156599
 ] 

Robin East commented on SPARK-6808:
---

It doesn't look like this should be tagged with GraphX - as the Reporter 
mentions he can reproduce using just Spark core code.

> Checkpointing after zipPartitions results in NODE_LOCAL execution
> -
>
> Key: SPARK-6808
> URL: https://issues.apache.org/jira/browse/SPARK-6808
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.2.1, 1.3.0
> Environment: EC2 Ubuntu r3.8xlarge machines
>Reporter: Xinghao Pan
>
> I'm encountering a weird issue where a simple iterative zipPartition is 
> PROCESS_LOCAL before checkpointing, but turns NODE_LOCAL for all iterations 
> after checkpointing. More often than not, tasks are fetching remote blocks 
> from the network, leading to a 10x increase in runtime.
> Here's an example snippet of code:
> var R : RDD[(Long,Int)]
> = sc.parallelize((0 until numPartitions), numPartitions)
>   .mapPartitions(_ => new Array[(Long,Int)](1000).map(i => 
> (0L,0)).toSeq.iterator).cache()
> sc.setCheckpointDir(checkpointDir)
> var iteration = 0
> while (iteration < 50){
>   R = R.zipPartitions(R)((x,y) => x).cache()
>   if ((iteration+1) % checkpointIter == 0) R.checkpoint()
>   R.foreachPartition(_ => {})
>   iteration += 1
> }
> I've also tried to unpersist the old RDDs, and increased spark.locality.wait 
> but nether helps.
> Strangely, by adding a simple identity map
> R = R.map(x => x).cache()
> after the zipPartitions appears to partially mitigate the issue.
> The problem was originally triggered when I attempted to checkpoint after 
> doing joinVertices in GraphX, but the above example shows that the issue is 
> in Spark core too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10945) GraphX computes Pagerank with NaN (with some datasets)

2016-02-22 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15156639#comment-15156639
 ] 

Robin East commented on SPARK-10945:


It's not obvious how to reproduce this from the datasets available at the 
download site. You mentioned that 'dataset format was converted to edge-list, 
no edge weights at all.'. Can you share the code that converts from the 
WebGraph format to edge-list? Alternatively can you make the input file 
available?

> GraphX computes Pagerank with NaN (with some datasets)
> --
>
> Key: SPARK-10945
> URL: https://issues.apache.org/jira/browse/SPARK-10945
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.3.0
> Environment: Linux
>Reporter: Khaled Ammar
>  Labels: test
>
> Hi,
> I run GraphX in a medium size standalone Spark 1.3.0 installation. The 
> pagerank typically works fine, except with one dataset (Twitter: 
> http://law.di.unimi.it/webdata/twitter-2010). This is a public dataset that 
> is commonly used in research papers.
> I found that many vertices have an NaN values. This is true, even if the 
> algorithm run for 1 iteration only.  
> Thanks,
> -Khaled



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17869) Connect to Amazon S3 using signature version 4 (only choice in Frankfurt)

2016-10-11 Thread Robin B (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin B closed SPARK-17869.
---
Resolution: Won't Fix

You are right [~srowen]

> Connect to Amazon S3 using signature version 4 (only choice in Frankfurt)
> -
>
> Key: SPARK-17869
> URL: https://issues.apache.org/jira/browse/SPARK-17869
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0, 2.0.1
> Environment: Mac OS X / Ubuntu
> pyspark
> hadoop-aws:2.7.3
> aws-java-sdk:1.11.41
>Reporter: Robin B
>
> Connection fails with **400 Bad request** for S3 in Frankfurt region where 
> version 4 authentication is needed to connect. 
> This issue is somewhat related HADOOP-13325, but the solution (to include the 
> endpoint explicitly) does nothing to ameliorate the problem.
> 
> sc._jsc.hadoopConfiguration().set('fs.s3a.impl','org.apache.hadoop.fs.s3native.NativeS3FileSystem')
> 
> sc._jsc.hadoopConfiguration().set('com.amazonaws.services.s3.enableV4','true')
> 
> sc.setSystemProperty('SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY','true')
> 
> sc._jsc.hadoopConfiguration().set('fs.s3a.endpoint','s3.eu-central-1.amazonaws.com')
> sc._jsc.hadoopConfiguration().set('fs.s3a.awsAccessKeyId','ACCESS_KEY')
> 
> sc._jsc.hadoopConfiguration().set('fs.s3a.awsSecretAccessKey','SECRET_KEY')
> df = spark.read.csv("s3a://BUCKET-NAME/filename.csv")
> yields:
>   16/10/10 18:39:28 WARN DataSource: Error while looking for metadata 
> directory.
>   Traceback (most recent call last):
> File "", line 1, in 
> File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/readwriter.py",
>  line 363, in csv
>   return 
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
> File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
> File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/utils.py", 
> line 63, in deco
>   return f(*a, **kw)
> File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
>   py4j.protocol.Py4JJavaError: An error occurred while calling o35.csv.
>   : java.io.IOException: s3n://BUCKET-NAME : 400 : Bad Request
>   at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453)
>   at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
>   at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
>   at 
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at 
> org.apache.hadoop.fs.s3native.$Proxy7.retrieveMetadata(Unknown Source)
>   at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:360)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.L

[jira] [Created] (SPARK-17869) Connect to Amazon S3 using signature version 4 (only choice in Frankfurt)

2016-10-11 Thread Robin B (JIRA)
Robin B created SPARK-17869:
---

 Summary: Connect to Amazon S3 using signature version 4 (only 
choice in Frankfurt)
 Key: SPARK-17869
 URL: https://issues.apache.org/jira/browse/SPARK-17869
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.0.1, 2.0.0
 Environment: Mac OS X / Ubuntu
pyspark
hadoop-aws:2.7.3
aws-java-sdk:1.11.41
Reporter: Robin B


Connection fails with **400 Bad request** for S3 in Frankfurt region where 
version 4 authentication is needed to connect. 

This issue is somewhat related 
<HADOOP-13325|https://issues.apache.org/jira/browse/HADOOP-13325>, but the 
solution (to include the endpoint explicitly) does nothing to ameliorate the 
problem.


sc._jsc.hadoopConfiguration().set('fs.s3a.impl','org.apache.hadoop.fs.s3native.NativeS3FileSystem')

sc._jsc.hadoopConfiguration().set('com.amazonaws.services.s3.enableV4','true')

sc.setSystemProperty('SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY','true')

sc._jsc.hadoopConfiguration().set('fs.s3a.endpoint','s3.eu-central-1.amazonaws.com')
sc._jsc.hadoopConfiguration().set('fs.s3a.awsAccessKeyId','ACCESS_KEY')
sc._jsc.hadoopConfiguration().set('fs.s3a.awsSecretAccessKey','SECRET_KEY')

df = spark.read.csv("s3a://BUCKET-NAME/filename.csv")

yields:

16/10/10 18:39:28 WARN DataSource: Error while looking for metadata 
directory.
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/readwriter.py",
 line 363, in csv
return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
 line 933, in __call__
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/utils.py", 
line 63, in deco
return f(*a, **kw)
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
 line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.csv.
: java.io.IOException: s3n://BUCKET-NAME : 400 : Bad Request
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at 
org.apache.hadoop.fs.s3native.$Proxy7.retrieveMetadata(Unknown Source)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:360)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at 
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at 
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:401)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.Delega

[jira] [Updated] (SPARK-17869) Connect to Amazon S3 using signature version 4 (only choice in Frankfurt)

2016-10-11 Thread Robin B (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin B updated SPARK-17869:

Description: 
Connection fails with **400 Bad request** for S3 in Frankfurt region where 
version 4 authentication is needed to connect. 

This issue is somewhat related HADOOP-13325, but the solution (to include the 
endpoint explicitly) does nothing to ameliorate the problem.


sc._jsc.hadoopConfiguration().set('fs.s3a.impl','org.apache.hadoop.fs.s3native.NativeS3FileSystem')

sc._jsc.hadoopConfiguration().set('com.amazonaws.services.s3.enableV4','true')

sc.setSystemProperty('SDKGlobalConfiguration.ENABLE_S3_SIGV4_SYSTEM_PROPERTY','true')

sc._jsc.hadoopConfiguration().set('fs.s3a.endpoint','s3.eu-central-1.amazonaws.com')
sc._jsc.hadoopConfiguration().set('fs.s3a.awsAccessKeyId','ACCESS_KEY')
sc._jsc.hadoopConfiguration().set('fs.s3a.awsSecretAccessKey','SECRET_KEY')

df = spark.read.csv("s3a://BUCKET-NAME/filename.csv")

yields:

16/10/10 18:39:28 WARN DataSource: Error while looking for metadata 
directory.
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/readwriter.py",
 line 363, in csv
return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
 line 933, in __call__
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/pyspark/sql/utils.py", 
line 63, in deco
return f(*a, **kw)
  File 
"/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
 line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.csv.
: java.io.IOException: s3n://BUCKET-NAME : 400 : Bad Request
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:453)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411)
at 
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at 
org.apache.hadoop.fs.s3native.$Proxy7.retrieveMetadata(Unknown Source)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:476)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:360)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:350)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
at 
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:149)
at 
org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:401)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:

[jira] [Commented] (SPARK-15328) Word2Vec import for original binary format

2016-12-05 Thread Robin East (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723536#comment-15723536
 ] 

Robin East commented on SPARK-15328:


Any news on the PR for this? There seem to be a few issues with large-scale 
models as I mention in the comments

> Word2Vec import for original binary format
> --
>
> Key: SPARK-15328
> URL: https://issues.apache.org/jira/browse/SPARK-15328
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yuming Wang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29861) Reduce downtime in Spark standalone HA master switch

2019-11-12 Thread Robin Wolters (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robin Wolters updated SPARK-29861:
--
Summary: Reduce downtime in Spark standalone HA master switch  (was: Reduce 
leader election downtime in Spark standalone HA)

> Reduce downtime in Spark standalone HA master switch
> 
>
> Key: SPARK-29861
> URL: https://issues.apache.org/jira/browse/SPARK-29861
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Robin Wolters
>Priority: Minor
>
> As officially stated in the spark [HA 
> documention|https://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper],
>  the recovery process of Spark (standalone) master in HA with zookeeper takes 
> about 1-2 minutes. During this time no spark master is active, which makes 
> interaction with spark essentially impossible. 
> After looking for a way to reduce this downtime, it seems that this is mainly 
> caused by the leader election, which waits for open zookeeper connections to 
> be closed. This seems like an unnecessary downtime for example in case of a 
> planned VM update.
> I have fixed this in my setup by:
>  # Closing open zookeeper connections during spark shutdown
>  # Bumping the curator version and implementing a custom error policy that is 
> tolerant to a zookeeper connection suspension.
> I am preparing a pull request for review / further discussion on this issue.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29861) Reduce leader election downtime in Spark standalone HA

2019-11-12 Thread Robin Wolters (Jira)
Robin Wolters created SPARK-29861:
-

 Summary: Reduce leader election downtime in Spark standalone HA
 Key: SPARK-29861
 URL: https://issues.apache.org/jira/browse/SPARK-29861
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Robin Wolters


As officially stated in the spark [HA 
documention|https://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper],
 the recovery process of Spark (standalone) master in HA with zookeeper takes 
about 1-2 minutes. During this time no spark master is active, which makes 
interaction with spark essentially impossible. 

After looking for a way to reduce this downtime, it seems that this is mainly 
caused by the leader election, which waits for open zookeeper connections to be 
closed. This seems like an unnecessary downtime for example in case of a 
planned VM update.

I have fixed this in my setup by:
 # Closing open zookeeper connections during spark shutdown
 # Bumping the curator version and implementing a custom error policy that is 
tolerant to a zookeeper connection suspension.

I am preparing a pull request for review / further discussion on this issue.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org