[jira] [Created] (SPARK-37295) illegal reflective access operation has occurred; Please consider reporting this to the maintainers

2021-11-11 Thread Andrew Davidson (Jira)
Andrew Davidson created SPARK-37295:
---

 Summary: illegal reflective access operation has occurred; Please 
consider reporting this to the maintainers
 Key: SPARK-37295
 URL: https://issues.apache.org/jira/browse/SPARK-37295
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.2
 Environment: MacBook pro running mac OS 11.6

spark-3.1.2-bin-hadoop3.2

it is not clear to me how spark finds java. I believe I also have java 8 
installed somewhere

```

$ which java
~/anaconda3/envs/extraCellularRNA/bin/java

$ java -version
openjdk version "11.0.6" 2020-01-14
OpenJDK Runtime Environment (build 11.0.6+8-b765.1)
OpenJDK 64-Bit Server VM (build 11.0.6+8-b765.1, mixed mode)

```

 
Reporter: Andrew Davidson


```

   spark = SparkSession\
                .builder\
                .appName("TestEstimatedScalingFactors")\
                .getOrCreate()

```

generates the following warning

```

WARNING: An illegal reflective access operation has occurred

WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/Users/xxx/googleUCSC/kimLab/extraCellularRNA/terra/deseq/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar)
 to constructor java.nio.DirectByteBuffer(long,int)

WARNING: Please consider reporting this to the maintainers of 
org.apache.spark.unsafe.Platform

WARNING: Use --illegal-access=warn to enable warnings of further illegal 
reflective access operations

WARNING: All illegal access operations will be denied in a future release

21/11/11 12:51:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable

```

I am using pyspark spark-3.1.2-bin-hadoop3.2 on a MacBook pro running mac OS 
11.6

 

My small unit test see to work okay how ever It fails when I try and run on 
3.2.0

 

I

 

Any idea how I track down this issue? Kind regards

 

Andy

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23878) unable to import col() or lit()

2018-04-06 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428517#comment-16428517
 ] 

Andrew Davidson commented on SPARK-23878:
-

yes please mark resolved

 

> unable to import col() or lit()
> ---
>
> Key: SPARK-23878
> URL: https://issues.apache.org/jira/browse/SPARK-23878
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: eclipse 4.7.3
> pyDev 6.3.2
> pyspark==2.3.0
>Reporter: Andrew Davidson
>Priority: Major
> Attachments: eclipsePyDevPySparkConfig.png
>
>
> I have some code I am moving from a jupyter notebook to separate python 
> modules. My notebook uses col() and list() and works fine
> when I try to work with module files in my IDE I get the following errors. I 
> am also not able to run my unit tests.
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: lit load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color}
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: col load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color}
> I suspect that when you run pyspark it is generating the col and lit 
> functions?
> I found a discription of the problem @ 
> [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark]
>  I do not understand how to make this work in my IDE. I am not running 
> pyspark just an editor
> is there some sort of workaround or replacement for these missing functions?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23878) unable to import col() or lit()

2018-04-06 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428509#comment-16428509
 ] 

Andrew Davidson commented on SPARK-23878:
-

added screen shot to  https://www.pinterest.com/pin/842454674019358931/

> unable to import col() or lit()
> ---
>
> Key: SPARK-23878
> URL: https://issues.apache.org/jira/browse/SPARK-23878
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: eclipse 4.7.3
> pyDev 6.3.2
> pyspark==2.3.0
>Reporter: Andrew Davidson
>Priority: Major
> Attachments: eclipsePyDevPySparkConfig.png
>
>
> I have some code I am moving from a jupyter notebook to separate python 
> modules. My notebook uses col() and list() and works fine
> when I try to work with module files in my IDE I get the following errors. I 
> am also not able to run my unit tests.
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: lit load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color}
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: col load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color}
> I suspect that when you run pyspark it is generating the col and lit 
> functions?
> I found a discription of the problem @ 
> [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark]
>  I do not understand how to make this work in my IDE. I am not running 
> pyspark just an editor
> is there some sort of workaround or replacement for these missing functions?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23878) unable to import col() or lit()

2018-04-06 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428494#comment-16428494
 ] 

Andrew Davidson commented on SPARK-23878:
-

Hi Hyukjin

many thanks! I am not a python expert. I did not know about dynamic name 
spaces. Did a little googling

to configure dynamic namespace in eclipse

preferences ->pydev -> interpreter ->python interpreters 

add pyspark to 'forced build'

 

 

I attached a screen shot. Any idea how this can be added to the documentation 
so others do not have to waste times on this detail

p.s. you sent me a link to "support vitualenv in PySpark" 
https://issues.apache.org/jira/browse/SPARK-13587 

Once I made the pydev configuration change I am able to use pyspark in a 
virtualenv. I use this environment in eclipse pydev IDE with out any problems. 
I am able to run Juypter notebooks with out an problem

 

> unable to import col() or lit()
> ---
>
> Key: SPARK-23878
> URL: https://issues.apache.org/jira/browse/SPARK-23878
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: eclipse 4.7.3
> pyDev 6.3.2
> pyspark==2.3.0
>Reporter: Andrew Davidson
>Priority: Major
> Attachments: eclipsePyDevPySparkConfig.png
>
>
> I have some code I am moving from a jupyter notebook to separate python 
> modules. My notebook uses col() and list() and works fine
> when I try to work with module files in my IDE I get the following errors. I 
> am also not able to run my unit tests.
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: lit load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color}
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: col load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color}
> I suspect that when you run pyspark it is generating the col and lit 
> functions?
> I found a discription of the problem @ 
> [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark]
>  I do not understand how to make this work in my IDE. I am not running 
> pyspark just an editor
> is there some sort of workaround or replacement for these missing functions?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23878) unable to import col() or lit()

2018-04-06 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-23878:

Attachment: eclipsePyDevPySparkConfig.png

> unable to import col() or lit()
> ---
>
> Key: SPARK-23878
> URL: https://issues.apache.org/jira/browse/SPARK-23878
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: eclipse 4.7.3
> pyDev 6.3.2
> pyspark==2.3.0
>Reporter: Andrew Davidson
>Priority: Major
> Attachments: eclipsePyDevPySparkConfig.png
>
>
> I have some code I am moving from a jupyter notebook to separate python 
> modules. My notebook uses col() and list() and works fine
> when I try to work with module files in my IDE I get the following errors. I 
> am also not able to run my unit tests.
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: lit load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color}
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: col load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color}
> I suspect that when you run pyspark it is generating the col and lit 
> functions?
> I found a discription of the problem @ 
> [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark]
>  I do not understand how to make this work in my IDE. I am not running 
> pyspark just an editor
> is there some sort of workaround or replacement for these missing functions?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23878) unable to import col() or lit()

2018-04-05 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16427883#comment-16427883
 ] 

Andrew Davidson commented on SPARK-23878:
-

Hi Hyukjin

you are correct. Most IDE's are primarily language aware editors and builders. 
For example consider eclipse or IntelJ for developing a javascript website, or 
java servlet. The editor functionality knows about the syntax of the language 
you are working with along with the libraries and packages you are using. Often 
the IDE does some sort of continuous build or code analysis to help you find 
bugs without having to deploy 

Often the IDE makes it easy build, package, to actually deploy on some sort of 
test server and debug and or run unit tests.

So if pyspark is generating functions at turn time that going to cause problems 
for the IDE. the functions are not defined in the edit session. 

[http://www.learn4master.com/algorithms/pyspark-unit-test-set-up-sparkcontext] 
describes how to write unititests for pyspark that you can run from your 
command line and or from with in elipse.  I think a side effect is that they 
might cause the functions lit() and col() to be generated?

 

I could not find a work around for col() and lit().

 

    ret = df.select(

                col(columnName).cast("string").alias("key"),

                lit(value).alias("source")

            )

 

Kind regards

 

Andy  

> unable to import col() or lit()
> ---
>
> Key: SPARK-23878
> URL: https://issues.apache.org/jira/browse/SPARK-23878
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: eclipse 4.7.3
> pyDev 6.3.2
> pyspark==2.3.0
>Reporter: Andrew Davidson
>Priority: Major
>
> I have some code I am moving from a jupyter notebook to separate python 
> modules. My notebook uses col() and list() and works fine
> when I try to work with module files in my IDE I get the following errors. I 
> am also not able to run my unit tests.
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: lit load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color}
> {color:#FF}Description Resource Path Location Type{color}
> {color:#FF}Unresolved import: col load.py 
> /adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color}
> I suspect that when you run pyspark it is generating the col and lit 
> functions?
> I found a discription of the problem @ 
> [https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark]
>  I do not understand how to make this work in my IDE. I am not running 
> pyspark just an editor
> is there some sort of workaround or replacement for these missing functions?
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23878) unable to import col() or lit()

2018-04-05 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-23878:
---

 Summary: unable to import col() or lit()
 Key: SPARK-23878
 URL: https://issues.apache.org/jira/browse/SPARK-23878
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
 Environment: eclipse 4.7.3

pyDev 6.3.2

pyspark==2.3.0
Reporter: Andrew Davidson


I have some code I am moving from a jupyter notebook to separate python 
modules. My notebook uses col() and list() and works fine

when I try to work with module files in my IDE I get the following errors. I am 
also not able to run my unit tests.

{color:#FF}Description Resource Path Location Type{color}
{color:#FF}Unresolved import: lit load.py 
/adt_pyDevProj/src/automatedDataTranslation line 22 PyDev Problem{color}

{color:#FF}Description Resource Path Location Type{color}
{color:#FF}Unresolved import: col load.py 
/adt_pyDevProj/src/automatedDataTranslation line 21 PyDev Problem{color}

I suspect that when you run pyspark it is generating the col and lit functions?

I found a discription of the problem @ 
[https://stackoverflow.com/questions/40163106/cannot-find-col-function-in-pyspark]
 I do not understand how to make this work in my IDE. I am not running pyspark 
just an editor

is there some sort of workaround or replacement for these missing functions?

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-22 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431409#comment-15431409
 ] 

Andrew Davidson commented on SPARK-17172:
-

Hi Sean

I forgot about that older jira issue. I never resolved it. I am using juypter. 
I believe each notebook gets it own spark context. I googled around and found 
some old issue that seem to suggest that a hive and sql context where being 
created . I have not figure out how to either use a different database for the 
hive context or prevent the original spark context from being created.



> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-22 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431371#comment-15431371
 ] 

Andrew Davidson commented on SPARK-17172:
-

Hi Sean

the data center was created using spark-ec2 from spark-1.6.1-bin-hadoop2.6

ec2-user@ip-172-31-22-140 root]$ cat /root/spark/RELEASE 
Spark 1.6.1 built for Hadoop 2.0.0-mr1-cdh4.2.0
Build flags: -Psparkr -Phadoop-1 -Phive -Phive-thriftserver 
-Dhadoop.version=2.0.0-mr1-cdh4.2.0 -DzincPort=3032
[ec2-user@ip-172-31-22-140 root]$ 

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-22 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431018#comment-15431018
 ] 

Andrew Davidson commented on SPARK-17172:
-

Hi Sean

It should be very easy to use the attached notebook to reproduce the hive bug. 
I got the code example from a blog. The original code worked in spark 1.5.x

I also attached an html version of the notebook so you can see the entire stack 
trace with out having to start jupyter

thanks

Andy

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-21 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430004#comment-15430004
 ] 

Andrew Davidson commented on SPARK-17172:
-

Hi Sean

I do not think it is the same error. In the related to bug, I could not create 
a udf using sqlcontext. The work around solution was to change the permission 
on hdfs:///tmp  The error msg actually mentioned problem with /tmp. (I thought 
the msg referred to the file:///tmp ) not sure how permission got messed up? 
maybe some one deleted it by accident and spark does not recreated it if its 
missing?

so I am able to create udf using sqlcontext. hiveContext does not work. Given I 
fixed the hdfs:/// permission problem I think its probably something else. 
Hopefully the attached notebook makes it easy to reproduce

thanks

Andy

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-20 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-17172:

Attachment: hiveUDFBug.ipynb
hiveUDFBug.html

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-20 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429465#comment-15429465
 ] 

Andrew Davidson commented on SPARK-17172:
-

attached a notebook that demonstrates the bug. Also attaced an html version of 
notebook

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
> Attachments: hiveUDFBug.html, hiveUDFBug.ipynb
>
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-20 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429463#comment-15429463
 ] 

Andrew Davidson commented on SPARK-17172:
-

related bug report : https://issues.apache.org/jira/browse/SPARK-17143

> pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while 
> calling None.org.apache.spark.sql.hive.HiveContext. 
> --
>
> Key: SPARK-17172
> URL: https://issues.apache.org/jira/browse/SPARK-17172
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.2
> Environment: spark version: 1.6.2
> python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
> [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
>Reporter: Andrew Davidson
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
> # Define udf
> from pyspark.sql.functions import udf
> def scoreToCategory(score):
> if score >= 80: return 'A'
> elif score >= 60: return 'B'
> elif score >= 35: return 'C'
> else: return 'D'
>  
> udfScoreToCategory=udf(scoreToCategory, StringType())
> throws exception
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.lang.RuntimeException: Unable to 
> instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17172) pyspak hiveContext can not create UDF: Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.

2016-08-20 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-17172:
---

 Summary: pyspak hiveContext can not create UDF: Py4JJavaError: An 
error occurred while calling None.org.apache.spark.sql.hive.HiveContext. 
 Key: SPARK-17172
 URL: https://issues.apache.org/jira/browse/SPARK-17172
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.2
 Environment: spark version: 1.6.2
python version: 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]
Reporter: Andrew Davidson


from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

# Define udf
from pyspark.sql.functions import udf
def scoreToCategory(score):
if score >= 80: return 'A'
elif score >= 60: return 'B'
elif score >= 35: return 'C'
else: return 'D'
 
udfScoreToCategory=udf(scoreToCategory, StringType())

throws exception

Py4JJavaError: An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427394#comment-15427394
 ] 

Andrew Davidson commented on SPARK-17143:
-

See email from user's group. I was able to find a work around. Not sure how 
hdfs:///tmp/ got created or how the permissions got messed up

##

NICE CATCH!!! Many thanks. 

I spent all day on this bug

The error msg report /tmp. I did not think to look on hdfs.

[ec2-user@ip-172-31-22-140 notebooks]$ hadoop fs -ls hdfs:///tmp/
Found 1 items
-rw-r--r--   3 ec2-user supergroup418 2016-04-13 22:49 hdfs:///tmp
[ec2-user@ip-172-31-22-140 notebooks]$ 


I have no idea how hdfs:///tmp got created. I deleted it. 

This causes a bunch of exceptions. These exceptions has useful message. I was 
able to fix the problem as follows

$ hadoop fs -rmr hdfs:///tmp

Now I run the notebook. It creates hdfs:///tmp/hive but the permission are wrong

$ hadoop fs -chmod 777 hdfs:///tmp/hive


From: Felix Cheung 
Date: Thursday, August 18, 2016 at 3:37 PM
To: Andrew Davidson , "user @spark" 

Subject: Re: pyspark unable to create UDF: java.lang.RuntimeException: 
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
directory: /tmp tmp

Do you have a file called tmp at / on HDFS?




> pyspark unable to create UDF: java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> ---
>
> Key: SPARK-17143
> URL: https://issues.apache.org/jira/browse/SPARK-17143
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
>Reporter: Andrew Davidson
> Attachments: udfBug.html, udfBug.ipynb
>
>
> For unknown reason I can not create UDF when I run the attached notebook on 
> my cluster. I get the following error
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> The notebook runs fine on my Mac
> In general I am able to run non UDF spark code with out any trouble
> I start the notebook server as the user “ec2-user" and uses master URL 
>   spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066
> I found the following message in the notebook server log file. I have log 
> level set to warn
> 16/08/18 21:38:45 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> The cluster was originally created using 
> spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> #from pyspark.sql import SQLContext, HiveContext
> #sqlContext = SQLContext(sc)
> ​
> #from pyspark.sql import DataFrame
> #from pyspark.sql import functions
> ​
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import udf
> ​
> print("spark version: {}".format(sc.version))
> ​
> import sys
> print("python version: {}".format(sys.version))
> spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
> # functions.lower() raises 
> # py4j.Py4JException: Method lower([class java.lang.String]) does not exist
> # work around define a UDF
> toLowerUDFRetType = StringType()
> #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> toLowerUDF = udf(lambda s : s.lower(), StringType())
> You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
> assembly
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
>   4 toLowerUDFRetType = StringType()
>   5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> > 6 toLowerUDF = udf(lambda s : s.lower(), StringType())
> /root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
>1595 [Row(slen=5), Row(slen=3)]
>1596 """
> -> 1597 return UserDefinedFunction(f, returnType)
>1598
>1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
> /root/spark/python/pyspark/sql/functions.py in __init__(self, func, 
> returnType, name)
>1556 self.returnType = returnType
>1557 self._broadcast = None
> -> 1558 self._judf = self._create_judf(name)
>1559
>1560 def _create_judf(self, name):
> /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
> 

[jira] [Commented] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15427278#comment-15427278
 ] 

Andrew Davidson commented on SPARK-17143:
-

given the exception metioned an issue with /tmp I decide to track how /tmp 
changed when run my cell

# no spark jobs are running
[ec2-user@ip-172-31-22-140 notebooks]$ !ls
ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  pip_build_ec2-user
[ec2-user@ip-172-31-22-140 notebooks]$ 

# start notebook server
$ nohup startIPythonNotebook.sh > startIPythonNotebook.sh.out &

[ec2-user@ip-172-31-22-140 notebooks]$ !ls
ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  pip_build_ec2-user
[ec2-user@ip-172-31-22-140 notebooks]$ 

# start the udfBug notebook
[ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  
libnetty-transport-native-epoll818283657820702.so  pip_build_ec2-user
[ec2-user@ip-172-31-22-140 notebooks]$ 

# execute cell that define UDF
[ec2-user@ip-172-31-22-140 notebooks]$ ls /tmp/
hsperfdata_ec2-user  hsperfdata_root  
libnetty-transport-native-epoll818283657820702.so  pip_build_ec2-user  
spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9
[ec2-user@ip-172-31-22-140 notebooks]$ 

[ec2-user@ip-172-31-22-140 notebooks]$ find 
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/db.lck
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log.ctrl
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/log1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/README_DO_NOT_TOUCH_FILES.txt
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/log/logmirror.ctrl
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/service.properties
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/README_DO_NOT_TOUCH_FILES.txt
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c230.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c4b0.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c241.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3a1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c180.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c2b1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7b1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c311.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c880.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c541.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c9f1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c20.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c590.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c721.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c470.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c441.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c8e1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c361.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c421.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c331.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c461.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c5d0.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c851.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c621.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c101.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c3d1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c891.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1b1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c641.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c871.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c6a1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/cb1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/ca01.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c391.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c7f1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c1a1.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c41.dat
/tmp/spark-15afb30e-b1ed-4fe9-9d09-bad63e2070a9/metastore/seg0/c990.dat

[jira] [Updated] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-17143:

Attachment: udfBug.html

This html version of the notebook shows the output when run in my data center

> pyspark unable to create UDF: java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> ---
>
> Key: SPARK-17143
> URL: https://issues.apache.org/jira/browse/SPARK-17143
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
>Reporter: Andrew Davidson
> Attachments: udfBug.html, udfBug.ipynb
>
>
> For unknown reason I can not create UDF when I run the attached notebook on 
> my cluster. I get the following error
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> The notebook runs fine on my Mac
> In general I am able to run non UDF spark code with out any trouble
> I start the notebook server as the user “ec2-user" and uses master URL 
>   spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066
> I found the following message in the notebook server log file. I have log 
> level set to warn
> 16/08/18 21:38:45 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> The cluster was originally created using 
> spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> #from pyspark.sql import SQLContext, HiveContext
> #sqlContext = SQLContext(sc)
> ​
> #from pyspark.sql import DataFrame
> #from pyspark.sql import functions
> ​
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import udf
> ​
> print("spark version: {}".format(sc.version))
> ​
> import sys
> print("python version: {}".format(sys.version))
> spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
> # functions.lower() raises 
> # py4j.Py4JException: Method lower([class java.lang.String]) does not exist
> # work around define a UDF
> toLowerUDFRetType = StringType()
> #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> toLowerUDF = udf(lambda s : s.lower(), StringType())
> You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
> assembly
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
>   4 toLowerUDFRetType = StringType()
>   5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> > 6 toLowerUDF = udf(lambda s : s.lower(), StringType())
> /root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
>1595 [Row(slen=5), Row(slen=3)]
>1596 """
> -> 1597 return UserDefinedFunction(f, returnType)
>1598
>1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
> /root/spark/python/pyspark/sql/functions.py in __init__(self, func, 
> returnType, name)
>1556 self.returnType = returnType
>1557 self._broadcast = None
> -> 1558 self._judf = self._create_judf(name)
>1559
>1560 def _create_judf(self, name):
> /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
>1567 pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command, self)
>1568 ctx = SQLContext.getOrCreate(sc)
> -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
>1570 if name is None:
>1571 name = f.__name__ if hasattr(f, '__name__') else 
> f.__class__.__name__
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 681 try:
> 682 if not hasattr(self, '_scala_HiveContext'):
> --> 683 self._scala_HiveContext = self._get_hive_ctx()
> 684 return self._scala_HiveContext
> 685 except Py4JError as e:
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 690
> 691 def _get_hive_ctx(self):
> --> 692 return self._jvm.HiveContext(self._jsc.sc())
> 693
> 694 def refreshTable(self, tableName):
> /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1062 answer = self._gateway_client.send_command(command)
>1063 return_value = get_return_value(
> -> 

[jira] [Updated] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-17143:

Attachment: udfBug.ipynb

The attached notebook demonstrated the reported bug. Note it includes the 
output when run on my mac book pro. The bug report contains the stack trace 
when the same code is run in my data center

> pyspark unable to create UDF: java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> ---
>
> Key: SPARK-17143
> URL: https://issues.apache.org/jira/browse/SPARK-17143
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
>Reporter: Andrew Davidson
> Attachments: udfBug.ipynb
>
>
> For unknown reason I can not create UDF when I run the attached notebook on 
> my cluster. I get the following error
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: 
> org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
> directory: /tmp tmp
> The notebook runs fine on my Mac
> In general I am able to run non UDF spark code with out any trouble
> I start the notebook server as the user “ec2-user" and uses master URL 
>   spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066
> I found the following message in the notebook server log file. I have log 
> level set to warn
> 16/08/18 21:38:45 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> The cluster was originally created using 
> spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> #from pyspark.sql import SQLContext, HiveContext
> #sqlContext = SQLContext(sc)
> ​
> #from pyspark.sql import DataFrame
> #from pyspark.sql import functions
> ​
> from pyspark.sql.types import StringType
> from pyspark.sql.functions import udf
> ​
> print("spark version: {}".format(sc.version))
> ​
> import sys
> print("python version: {}".format(sys.version))
> spark version: 1.6.1
> python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
> [GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
> # functions.lower() raises 
> # py4j.Py4JException: Method lower([class java.lang.String]) does not exist
> # work around define a UDF
> toLowerUDFRetType = StringType()
> #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> toLowerUDF = udf(lambda s : s.lower(), StringType())
> You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
> assembly
> Py4JJavaErrorTraceback (most recent call last)
>  in ()
>   4 toLowerUDFRetType = StringType()
>   5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> > 6 toLowerUDF = udf(lambda s : s.lower(), StringType())
> /root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
>1595 [Row(slen=5), Row(slen=3)]
>1596 """
> -> 1597 return UserDefinedFunction(f, returnType)
>1598
>1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
> /root/spark/python/pyspark/sql/functions.py in __init__(self, func, 
> returnType, name)
>1556 self.returnType = returnType
>1557 self._broadcast = None
> -> 1558 self._judf = self._create_judf(name)
>1559
>1560 def _create_judf(self, name):
> /root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
>1567 pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command, self)
>1568 ctx = SQLContext.getOrCreate(sc)
> -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
>1570 if name is None:
>1571 name = f.__name__ if hasattr(f, '__name__') else 
> f.__class__.__name__
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 681 try:
> 682 if not hasattr(self, '_scala_HiveContext'):
> --> 683 self._scala_HiveContext = self._get_hive_ctx()
> 684 return self._scala_HiveContext
> 685 except Py4JError as e:
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 690
> 691 def _get_hive_ctx(self):
> --> 692 return self._jvm.HiveContext(self._jsc.sc())
> 693
> 694 def refreshTable(self, tableName):
> /root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
>1062 

[jira] [Created] (SPARK-17143) pyspark unable to create UDF: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a directory: /tmp tmp

2016-08-18 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-17143:
---

 Summary: pyspark unable to create UDF: java.lang.RuntimeException: 
org.apache.hadoop.fs.FileAlreadyExistsException: Parent path is not a 
directory: /tmp tmp
 Key: SPARK-17143
 URL: https://issues.apache.org/jira/browse/SPARK-17143
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.1
 Environment: spark version: 1.6.1
python version: 3.4.3 (default, Apr  1 2015, 18:10:40) 
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]
Reporter: Andrew Davidson


For unknown reason I can not create UDF when I run the attached notebook on my 
cluster. I get the following error

Py4JJavaError: An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: org.apache.hadoop.fs.FileAlreadyExistsException: 
Parent path is not a directory: /tmp tmp

The notebook runs fine on my Mac

In general I am able to run non UDF spark code with out any trouble

I start the notebook server as the user “ec2-user" and uses master URL 
spark://ec2-51-215-120-63.us-west-1.compute.amazonaws.com:6066


I found the following message in the notebook server log file. I have log level 
set to warn

16/08/18 21:38:45 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
16/08/18 21:38:45 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException


The cluster was originally created using spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2



#from pyspark.sql import SQLContext, HiveContext
#sqlContext = SQLContext(sc)
​
#from pyspark.sql import DataFrame
#from pyspark.sql import functions
​
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
​
print("spark version: {}".format(sc.version))
​
import sys
print("python version: {}".format(sys.version))
spark version: 1.6.1
python version: 3.4.3 (default, Apr  1 2015, 18:10:40)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)]



# functions.lower() raises 
# py4j.Py4JException: Method lower([class java.lang.String]) does not exist
# work around define a UDF
toLowerUDFRetType = StringType()
#toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
toLowerUDF = udf(lambda s : s.lower(), StringType())
You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
assembly
Py4JJavaErrorTraceback (most recent call last)
 in ()
  4 toLowerUDFRetType = StringType()
  5 #toLowerUDF = udf(lambda s : s.lower(), toLowerUDFRetType)
> 6 toLowerUDF = udf(lambda s : s.lower(), StringType())

/root/spark/python/pyspark/sql/functions.py in udf(f, returnType)
   1595 [Row(slen=5), Row(slen=3)]
   1596 """
-> 1597 return UserDefinedFunction(f, returnType)
   1598
   1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']

/root/spark/python/pyspark/sql/functions.py in __init__(self, func, returnType, 
name)
   1556 self.returnType = returnType
   1557 self._broadcast = None
-> 1558 self._judf = self._create_judf(name)
   1559
   1560 def _create_judf(self, name):

/root/spark/python/pyspark/sql/functions.py in _create_judf(self, name)
   1567 pickled_command, broadcast_vars, env, includes = 
_prepare_for_python_RDD(sc, command, self)
   1568 ctx = SQLContext.getOrCreate(sc)
-> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
   1570 if name is None:
   1571 name = f.__name__ if hasattr(f, '__name__') else 
f.__class__.__name__

/root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
681 try:
682 if not hasattr(self, '_scala_HiveContext'):
--> 683 self._scala_HiveContext = self._get_hive_ctx()
684 return self._scala_HiveContext
685 except Py4JError as e:

/root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
690
691 def _get_hive_ctx(self):
--> 692 return self._jvm.HiveContext(self._jsc.sc())
693
694 def refreshTable(self, tableName):

/root/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, 
*args)
   1062 answer = self._gateway_client.send_command(command)
   1063 return_value = get_return_value(
-> 1064 answer, self._gateway_client, None, self._fqn)
   1065
   1066 for temp_arg in temp_args:

/root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 43 def deco(*a, **kw):
 44 try:
---> 45 return f(*a, **kw)
 46 except py4j.protocol.Py4JJavaError as e:
 47 s = e.java_exception.toString()

/root/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred 

[jira] [Commented] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode

2016-07-01 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359277#comment-15359277
 ] 

Andrew Davidson commented on SPARK-15829:
-

Hi Sean

you mention the ec2 script is not supported anymore? What was the last release 
it was supported in? Its still part of the 1.6.x documentation

Is there a replacement or alternative?

thanks

Andy

> spark master webpage links to application UI broke when running in cluster 
> mode
> ---
>
> Key: SPARK-15829
> URL: https://issues.apache.org/jira/browse/SPARK-15829
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: AWS ec2 cluster
>Reporter: Andrew Davidson
>Priority: Critical
>
> Hi 
> I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> I use the stand alone cluster manager. I have a streaming app running in 
> cluster mode. I notice the master webpage links to the application UI page 
> are incorrect
> It does not look like jira will let my upload images. I'll try and describe 
> the web pages and the bug
> My master is running on
> http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:8080/
> It has a section marked "applications". If I click on one of the running 
> application ids I am taken to a page showing "Executor Summary".  This page 
> has a link to teh 'application detail UI'  the url is 
> http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:4041/
> Notice it things the application UI is running on the cluster master.
> It is actually running on the same machine as the driver on port 4041. I was 
> able to reverse engine the url by noticing the private ip address is part of 
> the worker id . For example worker-20160322041632-172.31.23.201-34909
> next I went on the aws ec2 console to find the public DNS name for this 
> machine 
> http://ec2-54-193-104-169.us-west-1.compute.amazonaws.com:4041/streaming/
> Kind regards
> Andy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode

2016-07-01 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15359215#comment-15359215
 ] 

Andrew Davidson commented on SPARK-15829:
-

Hi Sean

I am not sure how to check the value of 'SPARK_MASTER_HOST'. I looked at the 
documentation pags http://spark.apache.org/docs/latest/configuration.html and  
http://spark.apache.org/docs/latest/ec2-scripts.html. They do not mention 
SPARK_MASTER_HOST

when I submit my jobs I use 
MASTER_URL=spark://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:6066

I use the stand alone cluster manager

I think the problem may be that the web UI assume the driver is always running 
on the master machine. I assume the cluster manager decides which worker the 
driver will run on. Is there a way for the web UI to discover where the driver 
is running?

On my master 
[ec2-user@ip-172-31-22-140 conf]$ pwd
/root/spark/conf
[ec2-user@ip-172-31-22-140 conf]$ cat slaves
ec2-54-193-94-207.us-west-1.compute.amazonaws.com
ec2-54-67-13-246.us-west-1.compute.amazonaws.com
ec2-54-67-48-49.us-west-1.compute.amazonaws.com
ec2-54-193-104-169.us-west-1.compute.amazonaws.com
[ec2-user@ip-172-31-22-140 conf]$ 


[ec2-user@ip-172-31-22-140 conf]$ grep SPARK_MASTER_HOST *
[ec2-user@ip-172-31-22-140 conf]$ pwd
/root/spark/conf
[ec2-user@ip-172-31-22-140 conf]$ grep SPARK_MASTER_HOST *
[ec2-user@ip-172-31-22-140 conf]$ 

ec2-user@ip-172-31-22-140 sbin]$ pwd
/root/spark/sbin
[ec2-user@ip-172-31-22-140 sbin]$ grep SPARK_MASTER_HOST *

[ec2-user@ip-172-31-22-140 bin]$ !grep
grep SPARK_MASTER_HOST *
[ec2-user@ip-172-31-22-140 bin]$ 

Thanks for looking into this

Andy


> spark master webpage links to application UI broke when running in cluster 
> mode
> ---
>
> Key: SPARK-15829
> URL: https://issues.apache.org/jira/browse/SPARK-15829
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: AWS ec2 cluster
>Reporter: Andrew Davidson
>Priority: Critical
>
> Hi 
> I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> I use the stand alone cluster manager. I have a streaming app running in 
> cluster mode. I notice the master webpage links to the application UI page 
> are incorrect
> It does not look like jira will let my upload images. I'll try and describe 
> the web pages and the bug
> My master is running on
> http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:8080/
> It has a section marked "applications". If I click on one of the running 
> application ids I am taken to a page showing "Executor Summary".  This page 
> has a link to teh 'application detail UI'  the url is 
> http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:4041/
> Notice it things the application UI is running on the cluster master.
> It is actually running on the same machine as the driver on port 4041. I was 
> able to reverse engine the url by noticing the private ip address is part of 
> the worker id . For example worker-20160322041632-172.31.23.201-34909
> next I went on the aws ec2 console to find the public DNS name for this 
> machine 
> http://ec2-54-193-104-169.us-west-1.compute.amazonaws.com:4041/streaming/
> Kind regards
> Andy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode

2016-06-10 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15325001#comment-15325001
 ] 

Andrew Davidson commented on SPARK-15829:
-

Hi Xin

I ran netstat on my master. I do not think the port are in use. 

To submit in cluster mode I use port 6066. If you are using port 7077 you are 
in client mode. In client mode the application UI will run on the spark master. 
In cluster mode the application UI runs on which ever slave the driver is 
running on. If you notice in my original description the url is incorrect. the 
ip is wrong, the port is correct.

Kind regards

Andy

# bash-4.2# netstat -tulpn 
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address   Foreign Address 
State   PID/Program name   
tcp0  0 0.0.0.0:86520.0.0.0:*   
LISTEN  3832/gmetad 
tcp0  0 0.0.0.0:87870.0.0.0:*   
LISTEN  2584/rserver
tcp0  0 0.0.0.0:36757   0.0.0.0:*   
LISTEN  2905/java   
tcp0  0 0.0.0.0:50070   0.0.0.0:*   
LISTEN  2905/java   
tcp0  0 0.0.0.0:22  0.0.0.0:*   
LISTEN  2144/sshd   
tcp0  0 127.0.0.1:631   0.0.0.0:*   
LISTEN  2095/cupsd  
tcp0  0 127.0.0.1:7000  0.0.0.0:*   
LISTEN  6512/python3.4  
tcp0  0 127.0.0.1:250.0.0.0:*   
LISTEN  2183/sendmail   
tcp0  0 0.0.0.0:43813   0.0.0.0:*   
LISTEN  3093/java   
tcp0  0 172.31.22.140:9000  0.0.0.0:*   
LISTEN  2905/java   
tcp0  0 0.0.0.0:86490.0.0.0:*   
LISTEN  3810/gmond  
tcp0  0 0.0.0.0:50090   0.0.0.0:*   
LISTEN  3093/java   
tcp0  0 0.0.0.0:86510.0.0.0:*   
LISTEN  3832/gmetad 
tcp0  0 :::8080 :::*
LISTEN  23719/java  
tcp0  0 :::8081 :::*
LISTEN  5588/java   
tcp0  0 :::172.31.22.140:6066   :::*
LISTEN  23719/java  
tcp0  0 :::172.31.22.140:6067   :::*
LISTEN  5588/java   
tcp0  0 :::22   :::*
LISTEN  2144/sshd   
tcp0  0 ::1:631 :::*
LISTEN  2095/cupsd  
tcp0  0 :::19998:::*
LISTEN  3709/java   
tcp0  0 :::1:::*
LISTEN  3709/java   
tcp0  0 :::172.31.22.140:7077   :::*
LISTEN  23719/java  
tcp0  0 :::172.31.22.140:7078   :::*
LISTEN  5588/java   
udp0  0 0.0.0.0:86490.0.0.0:*   
3810/gmond  
udp0  0 0.0.0.0:631 0.0.0.0:*   
2095/cupsd  
udp0  0 0.0.0.0:38546   0.0.0.0:*   
2905/java   
udp0  0 0.0.0.0:68  0.0.0.0:*   
1142/dhclient   
udp0  0 172.31.22.140:123   0.0.0.0:*   
2168/ntpd   
udp0  0 127.0.0.1:123   0.0.0.0:*   
2168/ntpd   
udp0  0 0.0.0.0:123 0.0.0.0:*   
2168/ntpd   
bash-4.2# 


> spark master webpage links to application UI broke when running in cluster 
> mode
> ---
>
> Key: SPARK-15829
> URL: https://issues.apache.org/jira/browse/SPARK-15829
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.6.1
> Environment: AWS ec2 cluster
>Reporter: Andrew Davidson
>Priority: Critical
>
> Hi 
> I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2
> I use the stand alone cluster manager. I have a streaming app running in 
> cluster mode. I notice the master webpage links to the application UI page 
> are incorrect
> It does not look like jira will let my upload images. I'll try and 

[jira] [Created] (SPARK-15829) spark master webpage links to application UI broke when running in cluster mode

2016-06-08 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-15829:
---

 Summary: spark master webpage links to application UI broke when 
running in cluster mode
 Key: SPARK-15829
 URL: https://issues.apache.org/jira/browse/SPARK-15829
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.6.1
 Environment: AWS ec2 cluster
Reporter: Andrew Davidson
Priority: Critical


Hi 
I created a cluster using the spark-1.6.1-bin-hadoop2.6/ec2/spark-ec2

I use the stand alone cluster manager. I have a streaming app running in 
cluster mode. I notice the master webpage links to the application UI page are 
incorrect

It does not look like jira will let my upload images. I'll try and describe the 
web pages and the bug

My master is running on
http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:8080/

It has a section marked "applications". If I click on one of the running 
application ids I am taken to a page showing "Executor Summary".  This page has 
a link to teh 'application detail UI'  the url is 
http://ec2-54-215-230-73.us-west-1.compute.amazonaws.com:4041/

Notice it things the application UI is running on the cluster master.

It is actually running on the same machine as the driver on port 4041. I was 
able to reverse engine the url by noticing the private ip address is part of 
the worker id . For example   worker-20160322041632-172.31.23.201-34909

next I went on the aws ec2 console to find the public DNS name for this machine 
http://ec2-54-193-104-169.us-west-1.compute.amazonaws.com:4041/streaming/

Kind regards

Andy




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15506) only one notebook can define a UDF; java.sql.SQLException: Another instance of Derby may have already booted the database

2016-05-26 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15302359#comment-15302359
 ] 

Andrew Davidson commented on SPARK-15506:
-

Hi Jeff

Here is how I start the notebook server. I believe spark uses jupyter

$SPARK_ROOT/bin/pyspark


Can you tell me where I can find out more about configuration details? I do not 
think the issue is multiple users. I discovered the bug while running two 
notebooks on my local machine. I.E. I was running both notebooks. It seem like 
the each notebook server needs it own data base?

Kind regards

Andy

p.s. even in our data center I start the notebook server the same way. I am the 
only data scientist 


> only one notebook can define a UDF; java.sql.SQLException: Another instance 
> of Derby may have already booted the database
> -
>
> Key: SPARK-15506
> URL: https://issues.apache.org/jira/browse/SPARK-15506
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: Mac OSX El Captain
> Python 3.4.2
>Reporter: Andrew Davidson
>
> I am using a sqlContext to create dataframes. I noticed that if I open up and 
> run 'notebook a' and 'a' defines a udf. That I will not be able to open a 
> second notebook that also defines a udf unless I shut down notebook a first.
> In the second notebook I get a big long stack trace.  The problem seems to be
> Caused by: java.sql.SQLException: Another instance of Derby may have already 
> booted the database 
> /Users/andrewdavidson/workSpace/bigPWSWorkspace/dataScience/notebooks/gnip/metastore_db.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
>  Source)
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   ... 86 more
> Here is the complete stack track
> Kind regards
> Andy
> You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
> assembly
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
>  16 #fooUDF = udf(lambda arg : "aedwip")
>  17 
> ---> 18 paddedStrUDF = udf(lambda zipInt : str(zipInt).zfill(5))
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py
>  in udf(f, returnType)
>1595 [Row(slen=5), Row(slen=3)]
>1596 """
> -> 1597 return UserDefinedFunction(f, returnType)
>1598 
>1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py
>  in __init__(self, func, returnType, name)
>1556 self.returnType = returnType
>1557 self._broadcast = None
> -> 1558 self._judf = self._create_judf(name)
>1559 
>1560 def _create_judf(self, name):
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py
>  in _create_judf(self, name)
>1567 pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command, self)
>1568 ctx = SQLContext.getOrCreate(sc)
> -> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
>1570 if name is None:
>1571 name = f.__name__ if hasattr(f, '__name__') else 
> f.__class__.__name__
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/context.py
>  in _ssql_ctx(self)
> 681 try:
> 682 if not hasattr(self, '_scala_HiveContext'):
> --> 683 self._scala_HiveContext = self._get_hive_ctx()
> 684 return self._scala_HiveContext
> 685 except Py4JError as e:
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/context.py
>  in _get_hive_ctx(self)
> 690 
> 691 def _get_hive_ctx(self):
> --> 692 return self._jvm.HiveContext(self._jsc.sc())
> 693 
> 694 def refreshTable(self, tableName):
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1062 answer = self._gateway_client.send_command(command)
>1063 return_value = get_return_value(
> -> 1064 answer, self._gateway_client, None, self._fqn)
>1065 
>1066 for temp_arg in temp_args:
> /Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/utils.py
>  

[jira] [Created] (SPARK-15506) only one notebook can define a UDF; java.sql.SQLException: Another instance of Derby may have already booted the database

2016-05-24 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-15506:
---

 Summary: only one notebook can define a UDF; 
java.sql.SQLException: Another instance of Derby may have already booted the 
database
 Key: SPARK-15506
 URL: https://issues.apache.org/jira/browse/SPARK-15506
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.1
 Environment: Mac OSX El Captain
Python 3.4.2

Reporter: Andrew Davidson


I am using a sqlContext to create dataframes. I noticed that if I open up and 
run 'notebook a' and 'a' defines a udf. That I will not be able to open a 
second notebook that also defines a udf unless I shut down notebook a first.

In the second notebook I get a big long stack trace.  The problem seems to be

Caused by: java.sql.SQLException: Another instance of Derby may have already 
booted the database 
/Users/andrewdavidson/workSpace/bigPWSWorkspace/dataScience/notebooks/gnip/metastore_db.
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
Source)
... 86 more



Here is the complete stack track

Kind regards

Andy

You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt 
assembly
---
Py4JJavaError Traceback (most recent call last)
 in ()
 16 #fooUDF = udf(lambda arg : "aedwip")
 17 
---> 18 paddedStrUDF = udf(lambda zipInt : str(zipInt).zfill(5))

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py
 in udf(f, returnType)
   1595 [Row(slen=5), Row(slen=3)]
   1596 """
-> 1597 return UserDefinedFunction(f, returnType)
   1598 
   1599 blacklist = ['map', 'since', 'ignore_unicode_prefix']

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py
 in __init__(self, func, returnType, name)
   1556 self.returnType = returnType
   1557 self._broadcast = None
-> 1558 self._judf = self._create_judf(name)
   1559 
   1560 def _create_judf(self, name):

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/functions.py
 in _create_judf(self, name)
   1567 pickled_command, broadcast_vars, env, includes = 
_prepare_for_python_RDD(sc, command, self)
   1568 ctx = SQLContext.getOrCreate(sc)
-> 1569 jdt = ctx._ssql_ctx.parseDataType(self.returnType.json())
   1570 if name is None:
   1571 name = f.__name__ if hasattr(f, '__name__') else 
f.__class__.__name__

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/context.py
 in _ssql_ctx(self)
681 try:
682 if not hasattr(self, '_scala_HiveContext'):
--> 683 self._scala_HiveContext = self._get_hive_ctx()
684 return self._scala_HiveContext
685 except Py4JError as e:

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/context.py
 in _get_hive_ctx(self)
690 
691 def _get_hive_ctx(self):
--> 692 return self._jvm.HiveContext(self._jsc.sc())
693 
694 def refreshTable(self, tableName):

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
 in __call__(self, *args)
   1062 answer = self._gateway_client.send_command(command)
   1063 return_value = get_return_value(
-> 1064 answer, self._gateway_client, None, self._fqn)
   1065 
   1066 for temp_arg in temp_args:

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/pyspark/sql/utils.py
 in deco(*a, **kw)
 43 def deco(*a, **kw):
 44 try:
---> 45 return f(*a, **kw)
 46 except py4j.protocol.Py4JJavaError as e:
 47 s = e.java_exception.toString()

/Users/andrewdavidson/workSpace/spark/spark-1.6.1-bin-hadoop2.6/python/lib/py4j-0.9-src.zip/py4j/protocol.py
 in get_return_value(answer, gateway_client, target_id, name)
306 raise Py4JJavaError(
307 "An error occurred while calling {0}{1}{2}.\n".
--> 308 format(target_id, ".", name), value)
309 else:
310 raise Py4JError(

Py4JJavaError: An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
at 

[jira] [Commented] (SPARK-14057) sql time stamps do not respect time zones

2016-04-06 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15228529#comment-15228529
 ] 

Andrew Davidson commented on SPARK-14057:
-

Hi Vijay

I should have some time to take a look tomorrow

Kind regards

Andy

From:  "Vijay Parmar (JIRA)" 
Date:  Tuesday, April 5, 2016 at 3:23 PM
To:  Andrew Davidson 
Subject:  [jira] [Commented] (SPARK-14057) sql time stamps do not respect
time zones





> sql time stamps do not respect time zones
> -
>
> Key: SPARK-14057
> URL: https://issues.apache.org/jira/browse/SPARK-14057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Davidson
>Priority: Minor
>
> we have time stamp data. The time stamp data is UTC how ever when we load the 
> data into spark data frames, the system assume the time stamps are in the 
> local time zone. This causes problems for our data scientists. Often they 
> pull data from our data center into their local macs. The data centers run 
> UTC. There computers are typically in PST or EST.
> It is possible to hack around this problem
> This cause a lot of errors in their analysis
> A complete description of this issue can be found in the following mail msg
> https://www.mail-archive.com/user@spark.apache.org/msg48121.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14057) sql time stamps do not respect time zones

2016-03-29 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15216293#comment-15216293
 ] 

Andrew Davidson commented on SPARK-14057:
-

Hi Vijay

I am fairly new to this also. I think I would contact the Russel about his 
analysis spark-connector

potentially making a change to the way dates are handled could break a lot of 
existing apps. I think the key is to figure how to come up with a solution that 
is backwards compatible.

I have filed several bugs in the past but only after discussion on the users 
email group. I would suggest before you do a lot of work you see what the user 
group thinks

many thanks for taking this on

Andy

> sql time stamps do not respect time zones
> -
>
> Key: SPARK-14057
> URL: https://issues.apache.org/jira/browse/SPARK-14057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Davidson
>Priority: Minor
>
> we have time stamp data. The time stamp data is UTC how ever when we load the 
> data into spark data frames, the system assume the time stamps are in the 
> local time zone. This causes problems for our data scientists. Often they 
> pull data from our data center into their local macs. The data centers run 
> UTC. There computers are typically in PST or EST.
> It is possible to hack around this problem
> This cause a lot of errors in their analysis
> A complete description of this issue can be found in the following mail msg
> https://www.mail-archive.com/user@spark.apache.org/msg48121.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14057) sql time stamps do not respect time zones

2016-03-28 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214334#comment-15214334
 ] 

Andrew Davidson commented on SPARK-14057:
-

Hi Vijay

here is some more info from the email thread mentioned above Russel is very 
involved in the development of cassandra and the spark cassandra connector. He 
has a suggestion for how to fix this bug

kind regards

Andy

http://www.slideshare.net/RussellSpitzer

On Fri, Mar 18, 2016 at 11:35 AM Russell Spitzer 
wrote:
> Unfortunately part of Spark SQL. They have based their type on
> java.sql.timestamp (and date) which adjust to the client timezone when
> displaying and storing.
> See discussions
> http://stackoverflow.com/questions/9202857/timezones-in-sql-date-vs-java-sql-d
> ate
> And Code
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.s
> cala#L81-L93
> 

> sql time stamps do not respect time zones
> -
>
> Key: SPARK-14057
> URL: https://issues.apache.org/jira/browse/SPARK-14057
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Davidson
>Priority: Minor
>
> we have time stamp data. The time stamp data is UTC how ever when we load the 
> data into spark data frames, the system assume the time stamps are in the 
> local time zone. This causes problems for our data scientists. Often they 
> pull data from our data center into their local macs. The data centers run 
> UTC. There computers are typically in PST or EST.
> It is possible to hack around this problem
> This cause a lot of errors in their analysis
> A complete description of this issue can be found in the following mail msg
> https://www.mail-archive.com/user@spark.apache.org/msg48121.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14057) sql time stamps do not respect time zones

2016-03-21 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-14057:
---

 Summary: sql time stamps do not respect time zones
 Key: SPARK-14057
 URL: https://issues.apache.org/jira/browse/SPARK-14057
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Andrew Davidson
Priority: Minor


we have time stamp data. The time stamp data is UTC how ever when we load the 
data into spark data frames, the system assume the time stamps are in the local 
time zone. This causes problems for our data scientists. Often they pull data 
from our data center into their local macs. The data centers run UTC. There 
computers are typically in PST or EST.

It is possible to hack around this problem

This cause a lot of errors in their analysis

A complete description of this issue can be found in the following mail msg

https://www.mail-archive.com/user@spark.apache.org/msg48121.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()

2016-02-03 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15130923#comment-15130923
 ] 

Andrew Davidson commented on SPARK-13065:
-

The code looks really nice

well done

Andy

> streaming-twitter pass twitter4j.FilterQuery argument to 
> TwitterUtils.createStream()
> 
>
> Key: SPARK-13065
> URL: https://issues.apache.org/jira/browse/SPARK-13065
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: all
>Reporter: Andrew Davidson
>Priority: Minor
>  Labels: twitter
> Attachments: twitterFilterQueryPatch.tar.gz
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The twitter stream api is very powerful provides a lot of support for 
> twitter.com side filtering of status objects. When ever possible we want to 
> let twitter do as much work as possible for us.
> currently the spark twitter api only allows you to configure a small sub set 
> of possible filters 
> String{} filters = {"tag1", tag2"}
> JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
> filters);
> The current implemenation does 
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> filters: Seq[String],
> storageLevel: StorageLevel
>   ) extends Receiver[Status](storageLevel) with Logging {
> . . .
>   val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
>   } else {
> newTwitterStream.sample()
>   }
> ...
> rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
> should be able to pass a FilterQueryObject
> looks like an easy fix. See source code links bellow
> kind regards
> Andy
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89
> $ 2/2/16
> attached is my java implementation for this problem. Feel free to reuse it 
> how ever you like. In my streaming spark app main() I have the following code
>FilterQuery query = config.getFilterQuery().fetch();
> if (query != null) {
> // TODO https://issues.apache.org/jira/browse/SPARK-13065
> tweets = TwitterFilterQueryUtils.createStream(ssc, twitterAuth, 
> query);
> } /*else 
> spark native api
> String[] filters = {"tag1", tag2"}
> tweets = TwitterUtils.createStream(ssc, twitterAuth, filters);
> 
> see 
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89
> 
> causes
>  val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
> } */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()

2016-02-02 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128832#comment-15128832
 ] 

Andrew Davidson edited comment on SPARK-13065 at 2/2/16 7:20 PM:
-

Hi Sachin

I attached my java implementation for this enhancement as a reference. I also 
changed the description above. I added the code I use in my streaming spark app 
main() 

I chose a  bad name for the attachement . its not in patch format

Kind regards

Andy


was (Author: aedwip):
sorry bad name. its not in patch format

> streaming-twitter pass twitter4j.FilterQuery argument to 
> TwitterUtils.createStream()
> 
>
> Key: SPARK-13065
> URL: https://issues.apache.org/jira/browse/SPARK-13065
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: all
>Reporter: Andrew Davidson
>Priority: Minor
>  Labels: twitter
> Attachments: twitterFilterQueryPatch.tar.gz
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The twitter stream api is very powerful provides a lot of support for 
> twitter.com side filtering of status objects. When ever possible we want to 
> let twitter do as much work as possible for us.
> currently the spark twitter api only allows you to configure a small sub set 
> of possible filters 
> String{} filters = {"tag1", tag2"}
> JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
> filters);
> The current implemenation does 
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> filters: Seq[String],
> storageLevel: StorageLevel
>   ) extends Receiver[Status](storageLevel) with Logging {
> . . .
>   val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
>   } else {
> newTwitterStream.sample()
>   }
> ...
> rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
> should be able to pass a FilterQueryObject
> looks like an easy fix. See source code links bellow
> kind regards
> Andy
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89
> $ 2/2/16
> attached is my java implementation for this problem. Feel free to reuse it 
> how ever you like. In my streaming spark app main() I have the following code
>FilterQuery query = config.getFilterQuery().fetch();
> if (query != null) {
> // TODO https://issues.apache.org/jira/browse/SPARK-13065
> tweets = TwitterFilterQueryUtils.createStream(ssc, twitterAuth, 
> query);
> } /*else 
> spark native api
> String[] filters = {"tag1", tag2"}
> tweets = TwitterUtils.createStream(ssc, twitterAuth, filters);
> 
> see 
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89
> 
> causes
>  val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
> } */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()

2016-02-02 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-13065:

Description: 
The twitter stream api is very powerful provides a lot of support for 
twitter.com side filtering of status objects. When ever possible we want to let 
twitter do as much work as possible for us.

currently the spark twitter api only allows you to configure a small sub set of 
possible filters 

String{} filters = {"tag1", tag2"}
JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
filters);

The current implemenation does 

private[streaming]
class TwitterReceiver(
twitterAuth: Authorization,
filters: Seq[String],
storageLevel: StorageLevel
  ) extends Receiver[Status](storageLevel) with Logging {

. . .


  val query = new FilterQuery
  if (filters.size > 0) {
query.track(filters.mkString(","))
newTwitterStream.filter(query)
  } else {
newTwitterStream.sample()
  }

...

rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
should be able to pass a FilterQueryObject

looks like an easy fix. See source code links bellow

kind regards

Andy

https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60

https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89


$ 2/2/16
attached is my java implementation for this problem. Feel free to reuse it how 
ever you like. In my streaming spark app main() I have the following code

   FilterQuery query = config.getFilterQuery().fetch();
if (query != null) {
// TODO https://issues.apache.org/jira/browse/SPARK-13065
tweets = TwitterFilterQueryUtils.createStream(ssc, twitterAuth, 
query);
} /*else 
spark native api
String[] filters = {"tag1", tag2"}
tweets = TwitterUtils.createStream(ssc, twitterAuth, filters);

see 
https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89

causes
 val query = new FilterQuery
  if (filters.size > 0) {
query.track(filters.mkString(","))
newTwitterStream.filter(query)
} */

  was:
The twitter stream api is very powerful provides a lot of support for 
twitter.com side filtering of status objects. When ever possible we want to let 
twitter do as much work as possible for us.

currently the spark twitter api only allows you to configure a small sub set of 
possible filters 

String{} filters = {"tag1", tag2"}
JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
filters);

The current implemenation does 

private[streaming]
class TwitterReceiver(
twitterAuth: Authorization,
filters: Seq[String],
storageLevel: StorageLevel
  ) extends Receiver[Status](storageLevel) with Logging {

. . .


  val query = new FilterQuery
  if (filters.size > 0) {
query.track(filters.mkString(","))
newTwitterStream.filter(query)
  } else {
newTwitterStream.sample()
  }

...

rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
should be able to pass a FilterQueryObject

looks like an easy fix. See source code links bellow

kind regards

Andy

https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60

https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89


> streaming-twitter pass twitter4j.FilterQuery argument to 
> TwitterUtils.createStream()
> 
>
> Key: SPARK-13065
> URL: https://issues.apache.org/jira/browse/SPARK-13065
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: all
>Reporter: Andrew Davidson
>Priority: Minor
>  Labels: twitter
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The twitter stream api is very powerful provides a lot of support for 
> twitter.com side filtering of status objects. When ever possible we want to 
> let twitter do as much work as possible for us.
> currently the spark twitter api only allows you to configure a small sub set 
> of possible filters 
> String{} filters = {"tag1", tag2"}
> JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
> filters);
> The current implemenation does 
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> 

[jira] [Updated] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()

2016-02-02 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-13065:

Attachment: twitterFilterQueryPatch.tar.gz

sorry bad name. its not in patch format

> streaming-twitter pass twitter4j.FilterQuery argument to 
> TwitterUtils.createStream()
> 
>
> Key: SPARK-13065
> URL: https://issues.apache.org/jira/browse/SPARK-13065
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: all
>Reporter: Andrew Davidson
>Priority: Minor
>  Labels: twitter
> Attachments: twitterFilterQueryPatch.tar.gz
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The twitter stream api is very powerful provides a lot of support for 
> twitter.com side filtering of status objects. When ever possible we want to 
> let twitter do as much work as possible for us.
> currently the spark twitter api only allows you to configure a small sub set 
> of possible filters 
> String{} filters = {"tag1", tag2"}
> JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
> filters);
> The current implemenation does 
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> filters: Seq[String],
> storageLevel: StorageLevel
>   ) extends Receiver[Status](storageLevel) with Logging {
> . . .
>   val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
>   } else {
> newTwitterStream.sample()
>   }
> ...
> rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
> should be able to pass a FilterQueryObject
> looks like an easy fix. See source code links bellow
> kind regards
> Andy
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89
> $ 2/2/16
> attached is my java implementation for this problem. Feel free to reuse it 
> how ever you like. In my streaming spark app main() I have the following code
>FilterQuery query = config.getFilterQuery().fetch();
> if (query != null) {
> // TODO https://issues.apache.org/jira/browse/SPARK-13065
> tweets = TwitterFilterQueryUtils.createStream(ssc, twitterAuth, 
> query);
> } /*else 
> spark native api
> String[] filters = {"tag1", tag2"}
> tweets = TwitterUtils.createStream(ssc, twitterAuth, filters);
> 
> see 
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89
> 
> causes
>  val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
> } */



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13009) spark-streaming-twitter_2.10 does not make it possible to access the raw twitter json

2016-02-02 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15128843#comment-15128843
 ] 

Andrew Davidson commented on SPARK-13009:
-

I Sean

I total agree with you. The Twitter4j people asked me to file a RFE with spark. 
I agree it is their problem.  I just looking for some sort of work around. My 
down stream systems will not be able to process the data I am capturing.

I guess in the short term I create the wrapper object and modify the spark 
twitter source code

kind regards

Andy

> spark-streaming-twitter_2.10 does not make it possible to access the raw 
> twitter json
> -
>
> Key: SPARK-13009
> URL: https://issues.apache.org/jira/browse/SPARK-13009
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Andrew Davidson
>Priority: Minor
>
> The Streaming-twitter package makes it easy for Java programmers to work with 
> twitter. The implementation returns the raw twitter data in JSON formate as a 
> twitter4J StatusJSONImpl object
> JavaDStream tweets = TwitterUtils.createStream(ssc, twitterAuth);
> The status class is different then the raw JSON. I.E. serializing the status 
> object will be the same as the original json. I have down stream systems that 
> can only process raw tweets not twitter4J Status objects. 
> Here is my bug/RFE request made to Twitter4J . 
> They asked  I create a spark tracking issue.
> On Thursday, January 21, 2016 at 6:27:25 PM UTC, Andy Davidson wrote:
> Hi All
> Quick problem summary:
> My system uses the Status objects to do some analysis how ever I need to 
> store the raw JSON. There are other systems that process that data that are 
> not written in Java.
> Currently we are serializing the Status Object. The JSON is going to break 
> down stream systems.
> I am using the Apache Spark Streaming spark-streaming-twitter_2.10  
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources
> Request For Enhancement:
> I imagine easy access to the raw JSON is a common requirement. Would it be 
> possible to add a member function to StatusJSONImpl getRawJson(). By default 
> the returned value would be null unless jsonStoreEnabled=True  is set in the 
> config.
> Alternative implementations:
>  
> It should be possible to modify the spark-streaming-twitter_2.10 to provide 
> this support. The solutions is not very clean
> It would required apache spark to define their own Status Pojo. The current 
> StatusJSONImpl class is marked final
> The Wrapper is not going to work nicely with existing code.
> spark-streaming-twitter_2.10  does not expose all of the twitter streaming 
> API so many developers are writing their implementations of 
> org.apache.park.streaming.twitter.TwitterInputDStream. This make maintenance 
> difficult. Its not easy to know when the spark implementation for twitter has 
> changed. 
> Code listing for 
> spark-1.6.0/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> filters: Seq[String],
> storageLevel: StorageLevel
>   ) extends Receiver[Status](storageLevel) with Logging {
>   @volatile private var twitterStream: TwitterStream = _
>   @volatile private var stopped = false
>   def onStart() {
> try {
>   val newTwitterStream = new 
> TwitterStreamFactory().getInstance(twitterAuth)
>   newTwitterStream.addListener(new StatusListener {
> def onStatus(status: Status): Unit = {
>   store(status)
> }
> Ref: 
> https://forum.processing.org/one/topic/saving-json-data-from-twitter4j.html
> What do people think?
> Kind regards
> Andy
> From:  on behalf of Igor Brigadir 
> 
> Reply-To: 
> Date: Tuesday, January 19, 2016 at 5:55 AM
> To: Twitter4J 
> Subject: Re: [Twitter4J] trouble writing unit test
> Main issue is that the Json object is in the wrong json format.
> eg: "createdAt": 1449775664000 should be "created_at": "Thu Dec 10 19:27:44 
> + 2015", ...
> It looks like the json you have was serialized from a java Status object, 
> which makes json objects different to what you get from the API, 
> TwitterObjectFactory expects json from Twitter (I haven't had any problems 
> using TwitterObjectFactory instead of the Deprecated DataObjectFactory).
> You could "fix" it by matching the keys & values you have with the correct, 
> twitter API json - it should look like the example here: 
> https://dev.twitter.com/rest/reference/get/statuses/show/%3Aid
> But it might be easier to download the tweets again, but this 

[jira] [Commented] (SPARK-13009) spark-streaming-twitter_2.10 does not make it possible to access the raw twitter json

2016-02-01 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126936#comment-15126936
 ] 

Andrew Davidson commented on SPARK-13009:
-

Agreed. they ask me to file a spark RFE. I post your comment back to them and 
see what they say. 

If the StatusJSONImpl class was not marked final it would be easy for spark to 
create a the wrapper.

Andhy

> spark-streaming-twitter_2.10 does not make it possible to access the raw 
> twitter json
> -
>
> Key: SPARK-13009
> URL: https://issues.apache.org/jira/browse/SPARK-13009
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Andrew Davidson
>Priority: Blocker
>  Labels: twitter
>
> The Streaming-twitter package makes it easy for Java programmers to work with 
> twitter. The implementation returns the raw twitter data in JSON formate as a 
> twitter4J StatusJSONImpl object
> JavaDStream tweets = TwitterUtils.createStream(ssc, twitterAuth);
> The status class is different then the raw JSON. I.E. serializing the status 
> object will be the same as the original json. I have down stream systems that 
> can only process raw tweets not twitter4J Status objects. 
> Here is my bug/RFE request made to Twitter4J . 
> They asked  I create a spark tracking issue.
> On Thursday, January 21, 2016 at 6:27:25 PM UTC, Andy Davidson wrote:
> Hi All
> Quick problem summary:
> My system uses the Status objects to do some analysis how ever I need to 
> store the raw JSON. There are other systems that process that data that are 
> not written in Java.
> Currently we are serializing the Status Object. The JSON is going to break 
> down stream systems.
> I am using the Apache Spark Streaming spark-streaming-twitter_2.10  
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources
> Request For Enhancement:
> I imagine easy access to the raw JSON is a common requirement. Would it be 
> possible to add a member function to StatusJSONImpl getRawJson(). By default 
> the returned value would be null unless jsonStoreEnabled=True  is set in the 
> config.
> Alternative implementations:
>  
> It should be possible to modify the spark-streaming-twitter_2.10 to provide 
> this support. The solutions is not very clean
> It would required apache spark to define their own Status Pojo. The current 
> StatusJSONImpl class is marked final
> The Wrapper is not going to work nicely with existing code.
> spark-streaming-twitter_2.10  does not expose all of the twitter streaming 
> API so many developers are writing their implementations of 
> org.apache.park.streaming.twitter.TwitterInputDStream. This make maintenance 
> difficult. Its not easy to know when the spark implementation for twitter has 
> changed. 
> Code listing for 
> spark-1.6.0/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> filters: Seq[String],
> storageLevel: StorageLevel
>   ) extends Receiver[Status](storageLevel) with Logging {
>   @volatile private var twitterStream: TwitterStream = _
>   @volatile private var stopped = false
>   def onStart() {
> try {
>   val newTwitterStream = new 
> TwitterStreamFactory().getInstance(twitterAuth)
>   newTwitterStream.addListener(new StatusListener {
> def onStatus(status: Status): Unit = {
>   store(status)
> }
> Ref: 
> https://forum.processing.org/one/topic/saving-json-data-from-twitter4j.html
> What do people think?
> Kind regards
> Andy
> From:  on behalf of Igor Brigadir 
> 
> Reply-To: 
> Date: Tuesday, January 19, 2016 at 5:55 AM
> To: Twitter4J 
> Subject: Re: [Twitter4J] trouble writing unit test
> Main issue is that the Json object is in the wrong json format.
> eg: "createdAt": 1449775664000 should be "created_at": "Thu Dec 10 19:27:44 
> + 2015", ...
> It looks like the json you have was serialized from a java Status object, 
> which makes json objects different to what you get from the API, 
> TwitterObjectFactory expects json from Twitter (I haven't had any problems 
> using TwitterObjectFactory instead of the Deprecated DataObjectFactory).
> You could "fix" it by matching the keys & values you have with the correct, 
> twitter API json - it should look like the example here: 
> https://dev.twitter.com/rest/reference/get/statuses/show/%3Aid
> But it might be easier to download the tweets again, but this time use 
> TwitterObjectFactory.getRawJSON(status) to get the Original Json from the 
> Twitter API, and save that 

[jira] [Commented] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()

2016-02-01 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15126931#comment-15126931
 ] 

Andrew Davidson commented on SPARK-13065:
-

Thanks Sachin

I do not know Scala. As a temporary work around I would up rewriting the 
twitter stuff in Java so I could pass twitter4j.query object. It would be much 
better to improve the spark api

Andy

> streaming-twitter pass twitter4j.FilterQuery argument to 
> TwitterUtils.createStream()
> 
>
> Key: SPARK-13065
> URL: https://issues.apache.org/jira/browse/SPARK-13065
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: all
>Reporter: Andrew Davidson
>Priority: Minor
>  Labels: twitter
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The twitter stream api is very powerful provides a lot of support for 
> twitter.com side filtering of status objects. When ever possible we want to 
> let twitter do as much work as possible for us.
> currently the spark twitter api only allows you to configure a small sub set 
> of possible filters 
> String{} filters = {"tag1", tag2"}
> JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
> filters);
> The current implemenation does 
> private[streaming]
> class TwitterReceiver(
> twitterAuth: Authorization,
> filters: Seq[String],
> storageLevel: StorageLevel
>   ) extends Receiver[Status](storageLevel) with Logging {
> . . .
>   val query = new FilterQuery
>   if (filters.size > 0) {
> query.track(filters.mkString(","))
> newTwitterStream.filter(query)
>   } else {
> newTwitterStream.sample()
>   }
> ...
> rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
> should be able to pass a FilterQueryObject
> looks like an easy fix. See source code links bellow
> kind regards
> Andy
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60
> https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13065) streaming-twitter pass twitter4j.FilterQuery argument to TwitterUtils.createStream()

2016-01-28 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-13065:
---

 Summary: streaming-twitter pass twitter4j.FilterQuery argument to 
TwitterUtils.createStream()
 Key: SPARK-13065
 URL: https://issues.apache.org/jira/browse/SPARK-13065
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.6.0
 Environment: all
Reporter: Andrew Davidson
Priority: Minor


The twitter stream api is very powerful provides a lot of support for 
twitter.com side filtering of status objects. When ever possible we want to let 
twitter do as much work as possible for us.

currently the spark twitter api only allows you to configure a small sub set of 
possible filters 

String{} filters = {"tag1", tag2"}
JavaDStream tweets =TwitterUtils.createStream(ssc, twitterAuth, 
filters);

The current implemenation does 

private[streaming]
class TwitterReceiver(
twitterAuth: Authorization,
filters: Seq[String],
storageLevel: StorageLevel
  ) extends Receiver[Status](storageLevel) with Logging {

. . .


  val query = new FilterQuery
  if (filters.size > 0) {
query.track(filters.mkString(","))
newTwitterStream.filter(query)
  } else {
newTwitterStream.sample()
  }

...

rather than construct the FilterQuery object in TwitterReceiver.onStart(). we 
should be able to pass a FilterQueryObject

looks like an easy fix. See source code links bellow

kind regards

Andy

https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L60

https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala#L89



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13009) spark-streaming-twitter_2.10 does not make it possible to access the raw twitter json

2016-01-26 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-13009:
---

 Summary: spark-streaming-twitter_2.10 does not make it possible to 
access the raw twitter json
 Key: SPARK-13009
 URL: https://issues.apache.org/jira/browse/SPARK-13009
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.6.0
Reporter: Andrew Davidson
Priority: Blocker


The Streaming-twitter package makes it easy for Java programmers to work with 
twitter. The implementation returns the raw twitter data in JSON formate as a 
twitter4J StatusJSONImpl object

JavaDStream tweets = TwitterUtils.createStream(ssc, twitterAuth);

The status class is different then the raw JSON. I.E. serializing the status 
object will be the same as the original json. I have down stream systems that 
can only process raw tweets not twitter4J Status objects. 

Here is my bug/RFE request made to Twitter4J . They 
asked  I create a spark tracking issue.


On Thursday, January 21, 2016 at 6:27:25 PM UTC, Andy Davidson wrote:
Hi All

Quick problem summary:

My system uses the Status objects to do some analysis how ever I need to store 
the raw JSON. There are other systems that process that data that are not 
written in Java.
Currently we are serializing the Status Object. The JSON is going to break down 
stream systems.
I am using the Apache Spark Streaming spark-streaming-twitter_2.10  
http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources

Request For Enhancement:
I imagine easy access to the raw JSON is a common requirement. Would it be 
possible to add a member function to StatusJSONImpl getRawJson(). By default 
the returned value would be null unless jsonStoreEnabled=True  is set in the 
config.


Alternative implementations:
 

It should be possible to modify the spark-streaming-twitter_2.10 to provide 
this support. The solutions is not very clean

It would required apache spark to define their own Status Pojo. The current 
StatusJSONImpl class is marked final
The Wrapper is not going to work nicely with existing code.
spark-streaming-twitter_2.10  does not expose all of the twitter streaming API 
so many developers are writing their implementations of 
org.apache.park.streaming.twitter.TwitterInputDStream. This make maintenance 
difficult. Its not easy to know when the spark implementation for twitter has 
changed. 
Code listing for 
spark-1.6.0/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala

private[streaming]
class TwitterReceiver(
twitterAuth: Authorization,
filters: Seq[String],
storageLevel: StorageLevel
  ) extends Receiver[Status](storageLevel) with Logging {

  @volatile private var twitterStream: TwitterStream = _
  @volatile private var stopped = false

  def onStart() {
try {
  val newTwitterStream = new TwitterStreamFactory().getInstance(twitterAuth)
  newTwitterStream.addListener(new StatusListener {
def onStatus(status: Status): Unit = {
  store(status)
}
Ref: https://forum.processing.org/one/topic/saving-json-data-from-twitter4j.html

What do people think?

Kind regards

Andy

From:  on behalf of Igor Brigadir 

Reply-To: 
Date: Tuesday, January 19, 2016 at 5:55 AM
To: Twitter4J 
Subject: Re: [Twitter4J] trouble writing unit test

Main issue is that the Json object is in the wrong json format.

eg: "createdAt": 1449775664000 should be "created_at": "Thu Dec 10 19:27:44 
+ 2015", ...

It looks like the json you have was serialized from a java Status object, which 
makes json objects different to what you get from the API, TwitterObjectFactory 
expects json from Twitter (I haven't had any problems using 
TwitterObjectFactory instead of the Deprecated DataObjectFactory).

You could "fix" it by matching the keys & values you have with the correct, 
twitter API json - it should look like the example here: 
https://dev.twitter.com/rest/reference/get/statuses/show/%3Aid

But it might be easier to download the tweets again, but this time use 
TwitterObjectFactory.getRawJSON(status) to get the Original Json from the 
Twitter API, and save that for later. (You must have jsonStoreEnabled=True in 
your config, and call getRawJSON in the same thread as .showStatus() or 
lookup() or whatever you're using to load tweets.)







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12606) Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ?

2016-01-02 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-12606:
---

 Summary: Scala/Java compatibility issue Re: how to extend java 
transformer from Scala UnaryTransformer ?
 Key: SPARK-12606
 URL: https://issues.apache.org/jira/browse/SPARK-12606
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.5.2
 Environment: Java 8, Mac OS, Spark-1.5.2
Reporter: Andrew Davidson



Hi Andy,

I suspect that you hit the Scala/Java compatibility issue, I can also reproduce 
this issue, so could you file a JIRA to track this issue?

Yanbo

2016-01-02 3:38 GMT+08:00 Andy Davidson :
I am trying to write a trivial transformer I use use in my pipeline. I am using 
java and spark 1.5.2. It was suggested that I use the Tokenize.scala class as 
an example. This should be very easy how ever I do not understand Scala, I am 
having trouble debugging the following exception.

Any help would be greatly appreciated.

Happy New Year

Andy

java.lang.IllegalArgumentException: requirement failed: Param null__inputCol 
does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557)
at org.apache.spark.ml.param.Params$class.set(params.scala:436)
at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
at org.apache.spark.ml.param.Params$class.set(params.scala:422)
at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37)
at 
org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83)
at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30)



public class StemmerTest extends AbstractSparkTest {
@Test
public void test() {
Stemmer stemmer = new Stemmer()
.setInputCol("raw”) //line 30
.setOutputCol("filtered");
}
}

/**
 * @ see 
spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
 * @ see 
https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
 * @ see 
http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hortonworks/
 * 
 * @author andrewdavidson
 *
 */
public class Stemmer extends UnaryTransformer implements Serializable{
static Logger logger = LoggerFactory.getLogger(Stemmer.class);
private static final long serialVersionUID = 1L;
private static final  ArrayType inputType = 
DataTypes.createArrayType(DataTypes.StringType, true);
private final String uid = Stemmer.class.getSimpleName() + "_" + 
UUID.randomUUID().toString();

@Override
public String uid() {
return uid;
}

/*
   override protected def validateInputType(inputType: DataType): Unit = {
require(inputType == StringType, s"Input type must be string type but got 
$inputType.")
  }
 */
@Override
public void validateInputType(DataType inputTypeArg) {
String msg = "inputType must be " + inputType.simpleString() + " but 
got " + inputTypeArg.simpleString();
assert (inputType.equals(inputTypeArg)) : msg; 
}

@Override
public Function1 createTransformFunc() {
// 
http://stackoverflow.com/questions/6545066/using-scala-from-java-passing-functions-as-parameters
Function1 f = new 
AbstractFunction1() {
public List apply(List words) {
for(String word : words) {
logger.error("AEDWIP input word: {}", word);
}
return words;
}
};

return f;
}

@Override
public DataType outputDataType() {
return DataTypes.createArrayType(DataTypes.StringType, true);
}
}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-12483:
---

 Summary: Data Frame as() does not work in Java
 Key: SPARK-12483
 URL: https://issues.apache.org/jira/browse/SPARK-12483
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
 Environment: Mac El Cap 10.11.2
Java 8
Reporter: Andrew Davidson


Following unit test demonstrates a bug in as(). The column name for aliasDF was 
not changed

   @Test
public void bugDataFrameAsTest() {
DataFrame df = createData();
df.printSchema();
df.show();

 DataFrame aliasDF = df.select("id").as("UUID");
 aliasDF.printSchema();
 aliasDF.show();
}

DataFrame createData() {
Features f1 = new Features(1, category1);
Features f2 = new Features(2, category2);
ArrayList data = new ArrayList(2);
data.add(f1);
data.add(f2);
//JavaRDD rdd = 
javaSparkContext.parallelize(Arrays.asList(f1, f2));
JavaRDD rdd = javaSparkContext.parallelize(data);
DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
return df;
}

This is the output I got (without the spark log msgs)

root
 |-- id: integer (nullable = false)
 |-- labelStr: string (nullable = true)

+---++
| id|labelStr|
+---++
|  1|   noise|
|  2|questionable|
+---++

root
 |-- id: integer (nullable = false)

+---+
| id|
+---+
|  1|
|  2|
+---+




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068696#comment-15068696
 ] 

Andrew Davidson commented on SPARK-12484:
-

What I am really trying to do is rewrite the following python code in Java. 
Ideally I would implement this code as a MLib.transformation how ever that does 
not seem possible at this point in time using the Java API

Kind regards

Andy

def convertMultinomialLabelToBinary(dataFrame):
newColName = "binomialLabel"
binomial = udf(lambda labelStr: labelStr if (labelStr == "noise") else 
"signal", StringType())
ret = dataFrame.withColumn(newColName, binomial(dataFrame["label"]))
return ret

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12483:

Attachment: SPARK_12483_unitTest.java

added a unit test file

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12484:

Attachment: UDFTest.java

Add a unit test file

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068698#comment-15068698
 ] 

Andrew Davidson commented on SPARK-12484:
-

releated issue https://issues.apache.org/jira/browse/SPARK-12483

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069027#comment-15069027
 ] 

Andrew Davidson commented on SPARK-12483:
-

Hi Xiao

thanks for looking at the issue

is there a way to change a column name? If you do a select() using a data 
frame, the column name is really strange

see attachement for https://issues.apache.org/jira/browse/SPARK-12484

 // get column from data frame call df.withColumnName
Column newCol = udfDF.col("_c0"); 

renaming data frame columns is very common in R

Kind regards

Andy

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069061#comment-15069061
 ] 

Andrew Davidson commented on SPARK-12484:
-

Hi Xiao

thanks for looking into this

Andy

> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12483:

Attachment: SPARK_12483_unitTest.java

add a unit test file 

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: SPARK_12483_unitTest.java
>
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-12484:
---

 Summary: DataFrame withColumn() does not work in Java
 Key: SPARK-12484
 URL: https://issues.apache.org/jira/browse/SPARK-12484
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2
 Environment: mac El Cap. 10.11.2
Java 8
Reporter: Andrew Davidson


DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises

 org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
transformedByUDF#3];
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12484) DataFrame withColumn() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068684#comment-15068684
 ] 

Andrew Davidson commented on SPARK-12484:
-

you can find some more back ground on the email thread 'should I file a bug? 
Re: trouble implementing Transformer and calling DataFrame.withColumn()' 



> DataFrame withColumn() does not work in Java
> 
>
> Key: SPARK-12484
> URL: https://issues.apache.org/jira/browse/SPARK-12484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: mac El Cap. 10.11.2
> Java 8
>Reporter: Andrew Davidson
> Attachments: UDFTest.java
>
>
> DataFrame transformerdDF = df.withColumn(fieldName, newCol); raises
>  org.apache.spark.sql.AnalysisException: resolved attribute(s) _c0#2 missing 
> from id#0,labelStr#1 in operator !Project [id#0,labelStr#1,_c0#2 AS 
> transformedByUDF#3];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
> at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
> at org.apache.spark.sql.DataFrame.select(DataFrame.scala:691)
> at org.apache.spark.sql.DataFrame.withColumn(DataFrame.scala:1150)
> at com.pws.fantasySport.ml.UDFTest.test(UDFTest.java:75)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12483:

Comment: was deleted

(was: add a unit test file )

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12483) Data Frame as() does not work in Java

2015-12-22 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12483:

Attachment: (was: SPARK_12483_unitTest.java)

> Data Frame as() does not work in Java
> -
>
> Key: SPARK-12483
> URL: https://issues.apache.org/jira/browse/SPARK-12483
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
> Environment: Mac El Cap 10.11.2
> Java 8
>Reporter: Andrew Davidson
>
> Following unit test demonstrates a bug in as(). The column name for aliasDF 
> was not changed
>@Test
> public void bugDataFrameAsTest() {
> DataFrame df = createData();
> df.printSchema();
> df.show();
> 
>  DataFrame aliasDF = df.select("id").as("UUID");
>  aliasDF.printSchema();
>  aliasDF.show();
> }
> DataFrame createData() {
> Features f1 = new Features(1, category1);
> Features f2 = new Features(2, category2);
> ArrayList data = new ArrayList(2);
> data.add(f1);
> data.add(f2);
> //JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(f1, f2));
> JavaRDD rdd = javaSparkContext.parallelize(data);
> DataFrame df = sqlContext.createDataFrame(rdd, Features.class);
> return df;
> }
> This is the output I got (without the spark log msgs)
> root
>  |-- id: integer (nullable = false)
>  |-- labelStr: string (nullable = true)
> +---++
> | id|labelStr|
> +---++
> |  1|   noise|
> |  2|questionable|
> +---++
> root
>  |-- id: integer (nullable = false)
> +---+
> | id|
> +---+
> |  1|
> |  2|
> +---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive

2015-12-03 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12110:

Attachment: launchCluster.sh.out
launchCluster.sh

launchCluster.sh is a wrapper around spark-ec2 script

launchCluster.sh is the output from when I ran this script on nov 5th 2015

> spark-1.5.1-bin-hadoop2.6;  pyspark.ml.feature  Exception: ("You must build 
> Spark with Hive 
> 
>
> Key: SPARK-12110
> URL: https://issues.apache.org/jira/browse/SPARK-12110
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.1
> Environment: cluster created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
>Reporter: Andrew Davidson
> Attachments: launchCluster.sh, launchCluster.sh.out
>
>
> I am using spark-1.5.1-bin-hadoop2.6. I used 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
> spark-env to use python3. I can not run the tokenizer sample code. Is there a 
> work around?
> Kind regards
> Andy
> {code}
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 658 raise Exception("You must build Spark with Hive. "
> 659 "Export 'SPARK_HIVE=true' and run "
> --> 660 "build/sbt assembly", e)
> 661 
> 662 def _get_hive_ctx(self):
> Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
> build/sbt assembly", Py4JJavaError('An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38))
> http://spark.apache.org/docs/latest/ml-features.html#tokenizer
> from pyspark.ml.feature import Tokenizer, RegexTokenizer
> sentenceDataFrame = sqlContext.createDataFrame([
>   (0, "Hi I heard about Spark"),
>   (1, "I wish Java could use case classes"),
>   (2, "Logistic,regression,models,are,neat")
> ], ["label", "sentence"])
> tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
> wordsDataFrame = tokenizer.transform(sentenceDataFrame)
> for words_label in wordsDataFrame.select("words", "label").take(3):
>   print(words_label)
> ---
> Py4JJavaError Traceback (most recent call last)
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 654 if not hasattr(self, '_scala_HiveContext'):
> --> 655 self._scala_HiveContext = self._get_hive_ctx()
> 656 return self._scala_HiveContext
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 662 def _get_hive_ctx(self):
> --> 663 return self._jvm.HiveContext(self._jsc.sc())
> 664 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 700 return_value = get_return_value(answer, self._gateway_client, 
> None,
> --> 701 self._fqn)
> 702 
> /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.io.IOException: Filesystem closed
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:214)
>   at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
>   at 

[jira] [Commented] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive

2015-12-03 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038574#comment-15038574
 ] 

Andrew Davidson commented on SPARK-12110:
-

Hi Davies

attached is a script I wrote to launch the cluster and the output it produced 
when I ran it on  on nov 5th, 2015

I also included al file LaunchingSparkCluster.md  with directions for how to 
configure the cluster to use java 8, python3, ...

On an related note. I would like to update my cluster to 1.5.2. I have not been 
able to find any directions for how to this 

Kind regards

Andy

p

> spark-1.5.1-bin-hadoop2.6;  pyspark.ml.feature  Exception: ("You must build 
> Spark with Hive 
> 
>
> Key: SPARK-12110
> URL: https://issues.apache.org/jira/browse/SPARK-12110
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.1
> Environment: cluster created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
>Reporter: Andrew Davidson
> Attachments: launchCluster.sh, launchCluster.sh.out, 
> launchingSparkCluster.md
>
>
> I am using spark-1.5.1-bin-hadoop2.6. I used 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
> spark-env to use python3. I can not run the tokenizer sample code. Is there a 
> work around?
> Kind regards
> Andy
> {code}
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 658 raise Exception("You must build Spark with Hive. "
> 659 "Export 'SPARK_HIVE=true' and run "
> --> 660 "build/sbt assembly", e)
> 661 
> 662 def _get_hive_ctx(self):
> Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
> build/sbt assembly", Py4JJavaError('An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38))
> http://spark.apache.org/docs/latest/ml-features.html#tokenizer
> from pyspark.ml.feature import Tokenizer, RegexTokenizer
> sentenceDataFrame = sqlContext.createDataFrame([
>   (0, "Hi I heard about Spark"),
>   (1, "I wish Java could use case classes"),
>   (2, "Logistic,regression,models,are,neat")
> ], ["label", "sentence"])
> tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
> wordsDataFrame = tokenizer.transform(sentenceDataFrame)
> for words_label in wordsDataFrame.select("words", "label").take(3):
>   print(words_label)
> ---
> Py4JJavaError Traceback (most recent call last)
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 654 if not hasattr(self, '_scala_HiveContext'):
> --> 655 self._scala_HiveContext = self._get_hive_ctx()
> 656 return self._scala_HiveContext
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 662 def _get_hive_ctx(self):
> --> 663 return self._jvm.HiveContext(self._jsc.sc())
> 664 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 700 return_value = get_return_value(answer, self._gateway_client, 
> None,
> --> 701 self._fqn)
> 702 
> /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.io.IOException: Filesystem closed
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
> 

[jira] [Commented] (SPARK-12111) need upgrade instruction

2015-12-03 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038563#comment-15038563
 ] 

Andrew Davidson commented on SPARK-12111:
-

Hi Sean

I am unable to find instructions for upgrading existing installations. Can you 
point me at the documentation for upgrading ? 

I built the culster using the using spark-ec2 script.  

Kind Regards

Andy

> need upgrade instruction
> 
>
> Key: SPARK-12111
> URL: https://issues.apache.org/jira/browse/SPARK-12111
> Project: Spark
>  Issue Type: Documentation
>  Components: EC2
>Affects Versions: 1.5.1
>Reporter: Andrew Davidson
>  Labels: build, documentation
>
> I have looked all over the spark website and googled. I have not found 
> instructions for how to upgrade spark in general let alone a cluster created 
> by using spark-ec2 script
> thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive

2015-12-03 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson updated SPARK-12110:

Attachment: launchingSparkCluster.md

> spark-1.5.1-bin-hadoop2.6;  pyspark.ml.feature  Exception: ("You must build 
> Spark with Hive 
> 
>
> Key: SPARK-12110
> URL: https://issues.apache.org/jira/browse/SPARK-12110
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.1
> Environment: cluster created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
>Reporter: Andrew Davidson
> Attachments: launchCluster.sh, launchCluster.sh.out, 
> launchingSparkCluster.md
>
>
> I am using spark-1.5.1-bin-hadoop2.6. I used 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
> spark-env to use python3. I can not run the tokenizer sample code. Is there a 
> work around?
> Kind regards
> Andy
> {code}
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 658 raise Exception("You must build Spark with Hive. "
> 659 "Export 'SPARK_HIVE=true' and run "
> --> 660 "build/sbt assembly", e)
> 661 
> 662 def _get_hive_ctx(self):
> Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
> build/sbt assembly", Py4JJavaError('An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38))
> http://spark.apache.org/docs/latest/ml-features.html#tokenizer
> from pyspark.ml.feature import Tokenizer, RegexTokenizer
> sentenceDataFrame = sqlContext.createDataFrame([
>   (0, "Hi I heard about Spark"),
>   (1, "I wish Java could use case classes"),
>   (2, "Logistic,regression,models,are,neat")
> ], ["label", "sentence"])
> tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
> wordsDataFrame = tokenizer.transform(sentenceDataFrame)
> for words_label in wordsDataFrame.select("words", "label").take(3):
>   print(words_label)
> ---
> Py4JJavaError Traceback (most recent call last)
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 654 if not hasattr(self, '_scala_HiveContext'):
> --> 655 self._scala_HiveContext = self._get_hive_ctx()
> 656 return self._scala_HiveContext
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 662 def _get_hive_ctx(self):
> --> 663 return self._jvm.HiveContext(self._jsc.sc())
> 664 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 700 return_value = get_return_value(answer, self._gateway_client, 
> None,
> --> 701 self._fqn)
> 702 
> /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.io.IOException: Filesystem closed
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>   at py4j.Gateway.invoke(Gateway.java:214)
>   at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
>   at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
>   at py4j.GatewayConnection.run(GatewayConnection.java:207)
>   at 

[jira] [Created] (SPARK-12100) bug in spark/python/pyspark/rdd.py portable_hash()

2015-12-02 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-12100:
---

 Summary: bug in spark/python/pyspark/rdd.py portable_hash()
 Key: SPARK-12100
 URL: https://issues.apache.org/jira/browse/SPARK-12100
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.1
Reporter: Andrew Davidson
Priority: Minor


I am using spark-1.5.1-bin-hadoop2.6. I used 
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
spark-env to use python3. I get and exception 'Randomness of hash of string 
should be disabled via PYTHONHASHSEED’. Is there any reason rdd.py should not 
just set PYTHONHASHSEED ?

Should I file a bug?

Kind regards

Andy

details

http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=subtract#pyspark.RDD.subtract

Example from documentation does not work out of the box

Subtract(other, numPartitions=None)
Return each value in self that is not contained in other.

>>> x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
>>> y = sc.parallelize([("a", 3), ("c", None)])
>>> sorted(x.subtract(y).collect())
[('a', 1), ('b', 4), ('b', 5)]
It raises 

if sys.version >= '3.3' and 'PYTHONHASHSEED' not in os.environ:
raise Exception("Randomness of hash of string should be disabled via 
PYTHONHASHSEED")


The following script fixes the problem 

Sudo printf "\n# set PYTHONHASHSEED so python3 will not generate 
Exception'Randomness of hash of string should be disabled via 
PYTHONHASHSEED'\nexport PYTHONHASHSEED=123\n" >> /root/spark/conf/spark-env.sh

sudo pssh -i -h /root/spark-ec2/slaves cp /root/spark/conf/spark-env.sh  
/root/spark/conf/spark-env.sh-`date "+%Y-%m-%d:%H:%M"`

Sudo for i in `cat slaves` ; do scp spark-env.sh 
root@$i:/root/spark/conf/spark-env.sh; done


This is how I am starting spark

export PYSPARK_PYTHON=python3.4
export PYSPARK_DRIVER_PYTHON=python3.4
export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"  

$SPARK_ROOT/bin/pyspark \
--master $MASTER_URL \
--total-executor-cores $numCores \
--driver-memory 2G \
--executor-memory 2G \
$extraPkgs \
$*


see email thread "possible bug spark/python/pyspark/rdd.py portable_hash()' on 
user@spark for more info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive

2015-12-02 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-12110:
---

 Summary: spark-1.5.1-bin-hadoop2.6;  pyspark.ml.feature  
Exception: ("You must build Spark with Hive 
 Key: SPARK-12110
 URL: https://issues.apache.org/jira/browse/SPARK-12110
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark, SQL
Affects Versions: 1.5.1
 Environment: cluster created using 
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
Reporter: Andrew Davidson


I am using spark-1.5.1-bin-hadoop2.6. I used 
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
spark-env to use python3. I can not run the tokenizer sample code. Is there a 
work around?

Kind regards

Andy

/root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
658 raise Exception("You must build Spark with Hive. "
659 "Export 'SPARK_HIVE=true' and run "
--> 660 "build/sbt assembly", e)
661 
662 def _get_hive_ctx(self):

Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
build/sbt assembly", Py4JJavaError('An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38))




http://spark.apache.org/docs/latest/ml-features.html#tokenizer

from pyspark.ml.feature import Tokenizer, RegexTokenizer

sentenceDataFrame = sqlContext.createDataFrame([
  (0, "Hi I heard about Spark"),
  (1, "I wish Java could use case classes"),
  (2, "Logistic,regression,models,are,neat")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsDataFrame = tokenizer.transform(sentenceDataFrame)
for words_label in wordsDataFrame.select("words", "label").take(3):
  print(words_label)

---
Py4JJavaError Traceback (most recent call last)
/root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
654 if not hasattr(self, '_scala_HiveContext'):
--> 655 self._scala_HiveContext = self._get_hive_ctx()
656 return self._scala_HiveContext

/root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
662 def _get_hive_ctx(self):
--> 663 return self._jvm.HiveContext(self._jsc.sc())
664 

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
700 return_value = get_return_value(answer, self._gateway_client, 
None,
--> 701 self._fqn)
702 

/root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
 35 try:
---> 36 return f(*a, **kw)
 37 except py4j.protocol.Py4JJavaError as e:

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:

Py4JJavaError: An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.io.IOException: Filesystem closed
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at 
org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
at 
org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
at 
org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at 
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:323)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1057)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:554)
at 
org.apache.hadoop.hive.ql.session.SessionState.createRootHDFSDir(SessionState.java:599)
at 

[jira] [Commented] (SPARK-12111) need upgrade instruction

2015-12-02 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036888#comment-15036888
 ] 

Andrew Davidson commented on SPARK-12111:
-

This is where someone that knows the details of how spark gets build and 
installed needs to provide some directions

> need upgrade instruction
> 
>
> Key: SPARK-12111
> URL: https://issues.apache.org/jira/browse/SPARK-12111
> Project: Spark
>  Issue Type: Documentation
>  Components: EC2
>Affects Versions: 1.5.1
>Reporter: Andrew Davidson
>  Labels: build, documentation
>
> I have looked all over the spark website and googled. I have not found 
> instructions for how to upgrade spark in general let alone a cluster created 
> by using spark-ec2 script
> thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12111) need upgrade instruction

2015-12-02 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15036887#comment-15036887
 ] 

Andrew Davidson commented on SPARK-12111:
-

Hi Sean

I understand I will need to stop by cluster to change to a different version. I 
am looking for directions for how to "change to a different version" E.G. on my 
local mac I have several different versions of spark down loaded. I have an env 
var SPARK_ROOT=pathToVersion I want to use.

To use something like pyspark I would 
$ $SPARK_ROOT/bin/pyspark

I am looking for direction for how to do something similar in a cluster env.  I 
think the rough steps would be

1) stop the cluster
2) down load the binary. Is the binary the same on all the machines (ie. 
masters and slaves?)
3)  I am not sure what do do about the config/* 

 

> need upgrade instruction
> 
>
> Key: SPARK-12111
> URL: https://issues.apache.org/jira/browse/SPARK-12111
> Project: Spark
>  Issue Type: Documentation
>  Components: EC2
>Affects Versions: 1.5.1
>Reporter: Andrew Davidson
>  Labels: build, documentation
>
> I have looked all over the spark website and googled. I have not found 
> instructions for how to upgrade spark in general let alone a cluster created 
> by using spark-ec2 script
> thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12111) need upgrade instruction

2015-12-02 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson reopened SPARK-12111:
-

Hi Sean

It must be possible for customers to upgrade installations. Given Spark is 
written in java it is probably a matter of replacing jar files and maybe making 
a few changes to config files.

Who ever is responsible for build/release of spark can probably write down the 
instructions. Its not reasonable to say destroy your old cluster and re-install 
it. In my experience spark does not work out of the box. You have to do a lot 
of work to configure it properly. I have a lot of data on HDFS I can not simply 
move it

sincerely yours

Andy

> need upgrade instruction
> 
>
> Key: SPARK-12111
> URL: https://issues.apache.org/jira/browse/SPARK-12111
> Project: Spark
>  Issue Type: Documentation
>  Components: EC2
>Affects Versions: 1.5.1
>Reporter: Andrew Davidson
>  Labels: build, documentation
>
> I have looked all over the spark website and googled. I have not found 
> instructions for how to upgrade spark in general let alone a cluster created 
> by using spark-ec2 script
> thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12111) need upgrade instruction

2015-12-02 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-12111:
---

 Summary: need upgrade instruction
 Key: SPARK-12111
 URL: https://issues.apache.org/jira/browse/SPARK-12111
 Project: Spark
  Issue Type: Documentation
  Components: EC2
Affects Versions: 1.5.1
Reporter: Andrew Davidson


I have looked all over the spark website and googled. I have not found 
instructions for how to upgrade spark in general let alone a cluster created by 
using spark-ec2 script

thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive

2015-12-02 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037017#comment-15037017
 ] 

Andrew Davidson commented on SPARK-12110:
-

Hi Patrick

Here is how I start my notebook on my cluster.

$ cat ../bin/startIPythonNotebook.sh 
export SPARK_ROOT=/root/spark
export 
MASTER_URL=spark://ec2-54-215-207-132.us-west-1.compute.amazonaws.com:7077
export PYSPARK_PYTHON=python3.4
export PYSPARK_DRIVER_PYTHON=python3.4
export IPYTHON_OPTS="notebook --no-browser --port=7000 --log-level=WARN"  

extraPkgs='--packages com.databricks:spark-csv_2.11:1.3.0'
numCores=3 # one for driver 2 for workers
$SPARK_ROOT/bin/pyspark \
--master $MASTER_URL \
--total-executor-cores $numCores \
--driver-memory 2G \
--executor-memory 2G \
$extraPkgs \
$*

> spark-1.5.1-bin-hadoop2.6;  pyspark.ml.feature  Exception: ("You must build 
> Spark with Hive 
> 
>
> Key: SPARK-12110
> URL: https://issues.apache.org/jira/browse/SPARK-12110
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.1
> Environment: cluster created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
>Reporter: Andrew Davidson
>
> I am using spark-1.5.1-bin-hadoop2.6. I used 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
> spark-env to use python3. I can not run the tokenizer sample code. Is there a 
> work around?
> Kind regards
> Andy
> {code}
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 658 raise Exception("You must build Spark with Hive. "
> 659 "Export 'SPARK_HIVE=true' and run "
> --> 660 "build/sbt assembly", e)
> 661 
> 662 def _get_hive_ctx(self):
> Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
> build/sbt assembly", Py4JJavaError('An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38))
> http://spark.apache.org/docs/latest/ml-features.html#tokenizer
> from pyspark.ml.feature import Tokenizer, RegexTokenizer
> sentenceDataFrame = sqlContext.createDataFrame([
>   (0, "Hi I heard about Spark"),
>   (1, "I wish Java could use case classes"),
>   (2, "Logistic,regression,models,are,neat")
> ], ["label", "sentence"])
> tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
> wordsDataFrame = tokenizer.transform(sentenceDataFrame)
> for words_label in wordsDataFrame.select("words", "label").take(3):
>   print(words_label)
> ---
> Py4JJavaError Traceback (most recent call last)
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 654 if not hasattr(self, '_scala_HiveContext'):
> --> 655 self._scala_HiveContext = self._get_hive_ctx()
> 656 return self._scala_HiveContext
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 662 def _get_hive_ctx(self):
> --> 663 return self._jvm.HiveContext(self._jsc.sc())
> 664 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 700 return_value = get_return_value(answer, self._gateway_client, 
> None,
> --> 701 self._fqn)
> 702 
> /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.io.IOException: Filesystem closed
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at 

[jira] [Commented] (SPARK-12110) spark-1.5.1-bin-hadoop2.6; pyspark.ml.feature Exception: ("You must build Spark with Hive

2015-12-02 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037022#comment-15037022
 ] 

Andrew Davidson commented on SPARK-12110:
-

Hi Patrick

when I run the same example code on my local Macbook Pro. It runs fine.

I am newbie. Is the spark-ec2 script deprecated? I noticed on my cluster

[ec2-user@ip-172-31-29-60 notebooks]$ cat /root/spark/RELEASE
Spark 1.5.1 built for Hadoop 1.2.1
Build flags: -Psparkr -Phadoop-1 -Phive -Phive-thriftserver -DzincPort=3030
[ec2-user@ip-172-31-29-60 notebooks]$ 


on my local mac
$ cat ./spark-1.5.1-bin-hadoop2.6/RELEASE 
Spark 1.5.1 built for Hadoop 2.6.0
Build flags: -Psparkr -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn 
-DzincPort=3034
$

It looks like the  ./spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 maybe installed 
the wrong version of spark?

> spark-1.5.1-bin-hadoop2.6;  pyspark.ml.feature  Exception: ("You must build 
> Spark with Hive 
> 
>
> Key: SPARK-12110
> URL: https://issues.apache.org/jira/browse/SPARK-12110
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.5.1
> Environment: cluster created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
>Reporter: Andrew Davidson
>
> I am using spark-1.5.1-bin-hadoop2.6. I used 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create a cluster and configured 
> spark-env to use python3. I can not run the tokenizer sample code. Is there a 
> work around?
> Kind regards
> Andy
> {code}
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 658 raise Exception("You must build Spark with Hive. "
> 659 "Export 'SPARK_HIVE=true' and run "
> --> 660 "build/sbt assembly", e)
> 661 
> 662 def _get_hive_ctx(self):
> Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
> build/sbt assembly", Py4JJavaError('An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o38))
> http://spark.apache.org/docs/latest/ml-features.html#tokenizer
> from pyspark.ml.feature import Tokenizer, RegexTokenizer
> sentenceDataFrame = sqlContext.createDataFrame([
>   (0, "Hi I heard about Spark"),
>   (1, "I wish Java could use case classes"),
>   (2, "Logistic,regression,models,are,neat")
> ], ["label", "sentence"])
> tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
> wordsDataFrame = tokenizer.transform(sentenceDataFrame)
> for words_label in wordsDataFrame.select("words", "label").take(3):
>   print(words_label)
> ---
> Py4JJavaError Traceback (most recent call last)
> /root/spark/python/pyspark/sql/context.py in _ssql_ctx(self)
> 654 if not hasattr(self, '_scala_HiveContext'):
> --> 655 self._scala_HiveContext = self._get_hive_ctx()
> 656 return self._scala_HiveContext
> /root/spark/python/pyspark/sql/context.py in _get_hive_ctx(self)
> 662 def _get_hive_ctx(self):
> --> 663 return self._jvm.HiveContext(self._jsc.sc())
> 664 
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
> __call__(self, *args)
> 700 return_value = get_return_value(answer, self._gateway_client, 
> None,
> --> 701 self._fqn)
> 702 
> /root/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
>  35 try:
> ---> 36 return f(*a, **kw)
>  37 except py4j.protocol.Py4JJavaError as e:
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.
> : java.lang.RuntimeException: java.io.IOException: Filesystem closed
>   at 
> org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.(ClientWrapper.scala:171)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:162)
>   at 
> org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:160)
>   at org.apache.spark.sql.hive.HiveContext.(HiveContext.scala:167)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> 

[jira] [Commented] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script

2015-11-06 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994616#comment-14994616
 ] 

Andrew Davidson commented on SPARK-11509:
-

okay after a couple of days hacking it looks like my test program might be 
working. Here is my recipe. (I hope this helps others)

My test program is now

In [1]:

from pyspark import SparkContext
textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
In [2]:

print("hello world”)

hello world
In [3]:

textFile.take(3)

Out[3]:
[' hello world', '']


Installation instructions

Ssh to cluster master
Sudo su
install python3.4 on all machines
```
yum install -y python34
bash-4.2# which python3
/usr/bin/python3

pssh -h /root/spark-ec2/slaves yum install -y python34

```

4. Install pip on all machines


```
yum list available |grep pip
yum install -y python34-pip

find /usr/bin -name "*pip*" -print
/usr/bin/pip-3.4

pssh -h /root/spark-ec2/slaves yum install -y python34-pip
```

5. install python on master

```
/usr/bin/pip-3.4 install ipython

pssh -h /root/spark-ec2/slaves /usr/bin/pip-3.4 install python
```

6. Install python develop stuff and jupiter on master
```
yum install -y python34-devel
/usr/bin/pip-3.4 install jupyter
```

7. Set up update spark-env.sh on all machine so by default we use python3.4

```
cd /root/spark/conf
printf "\n# Set Spark Python version\nexport PYSPARK_PYTHON=python3.4\n" >> 
/root/spark/conf/spark-env.sh
for i in `cat slaves` ; do scp spark-env.sh 
root@$i:/root/spark/conf/spark-env.sh; done
```

8. Restart cluster

```
/root/spark/sbin/stop-all.sh
/root/spark/sbin/start-all.sh
```

Running ipython notebook

set up an ssh tunnel on your local machine
ssh -i $KEY_FILE -N -f -L localhost::localhost:7000 ec2-user@$SPARK_MASTER

2. Log on to cluster master and start ipython notebook server

```
export PYSPARK_PYTHON=python3.4
export PYSPARK_DRIVER_PYTHON=python3.4
export IPYTHON_OPTS="notebook --no-browser --port=7000"  
$SPARK_ROOT/bin/pyspark --master local[2]
```

3. On your local machine open http://localhost:



> ipython notebooks do not work on clusters created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
> --
>
> Key: SPARK-11509
> URL: https://issues.apache.org/jira/browse/SPARK-11509
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, EC2, PySpark
>Affects Versions: 1.5.1
> Environment: AWS cluster
> [ec2-user@ip-172-31-29-60 ~]$ uname -a
> Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 
> SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Andrew Davidson
>
> I recently downloaded  spark-1.5.1-bin-hadoop2.6 to my local mac.
> I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am 
> able to run the java SparkPi example on the cluster how ever I am not able to 
> run ipython notebooks on the cluster. (I connect using ssh tunnel)
> According to the 1.5.1 getting started doc 
> http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell
> The following should work
>  PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook 
> --no-browser --port=7000" /root/spark/bin/pyspark
> I am able to connect to the notebook server and start a notebook how ever
> bug 1) the default sparkContext does not exist
> from pyspark import SparkContext
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3
> ---
> NameError Traceback (most recent call last)
>  in ()
>   1 from pyspark import SparkContext
> > 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
>   3 textFile.take(3)
> NameError: name 'sc' is not defined
> bug 2)
>  If I create a SparkContext I get the following python versions miss match 
> error
> sc = SparkContext("local", "Simple App")
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3)
>  File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
> ("%d.%d" % sys.version_info[:2], version))
> Exception: Python in worker has different version 2.7 than that in driver 
> 2.6, PySpark cannot run with different minor versions
> I am able to run ipython notebooks on my local Mac as follows. (by default 
> you would get an error that the driver and works are using different version 
> of python)
> $ cat ~/bin/pySparkNotebook.sh
> #!/bin/sh 
> set -x # turn debugging on
> #set +x # turn debugging off
> export PYSPARK_PYTHON=python3
> export PYSPARK_DRIVER_PYTHON=python3
> IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ 
> I have spent a lot of time trying to debug the pyspark 

[jira] [Reopened] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script

2015-11-06 Thread Andrew Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Davidson reopened SPARK-11509:
-

My issue is not resolved

I am able to use ipython notebooks on my local mac but still can not run them 
in/on my cluster.  

I have tried launching the same way I do on my mac

export PYSPARK_PYTHON=python2.7
export PYSPARK_DRIVER_PYTHON=python2.7
IPYTHON_OPTS="notebook --no-browser --port=7000"  $SPARK_ROOT/bin/pyspark 

I also tried setting 
export PYSPARK_PYTHON=python2.7

in /root/spark/conf/spark-env.sh on all my machines

The following code example

from pyspark import SparkContext
sc = SparkContext("local", "Simple App") # strange we should not have to create 
sc
textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
textFile.take(3)

generates the following error msg

Py4JJavaError Traceback (most recent call last)
 in ()
  2 sc = SparkContext("local", "Simple App")
  3 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> 4 textFile.take(3)

/root/spark/python/pyspark/rdd.py in take(self, num)
   1297 
   1298 p = range(partsScanned, min(partsScanned + numPartsToTry, 
totalParts))
-> 1299 res = self.context.runJob(self, takeUpToNumLeft, p)
   1300 
   1301 items += res

/root/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc, 
partitions, allowLocal)
914 # SparkContext#runJob.
915 mappedRDD = rdd.mapPartitions(partitionFunc)
--> 916 port = self._jvm.PythonRDD.runJob(self._jsc.sc(), 
mappedRDD._jrdd, partitions)
917 return list(_load_from_socket(port, 
mappedRDD._jrdd_deserializer))
918 

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
536 answer = self.gateway_client.send_command(command)
537 return_value = get_return_value(answer, self.gateway_client,
--> 538 self.target_id, self.name)
539 
540 for temp_arg in temp_args:

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(

Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): org.apache.spark.api.python.PythonException: Traceback (most 
recent call last):
  File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 2.6, 
PySpark cannot run with different minor versions

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at 

[jira] [Created] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script

2015-11-04 Thread Andrew Davidson (JIRA)
Andrew Davidson created SPARK-11509:
---

 Summary: ipython notebooks do not work on clusters created using 
spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
 Key: SPARK-11509
 URL: https://issues.apache.org/jira/browse/SPARK-11509
 Project: Spark
  Issue Type: Bug
  Components: Documentation, EC2, PySpark
Affects Versions: 1.5.1
 Environment: AWS cluster
[ec2-user@ip-172-31-29-60 ~]$ uname -a
Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 
SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Reporter: Andrew Davidson


I recently downloaded  spark-1.5.1-bin-hadoop2.6 to my local mac.

I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am 
able to run the java SparkPi example on the cluster how ever I am not able to 
run ipython notebooks on the cluster. (I connect using ssh tunnel)

According to the 1.5.1 getting started doc 
http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell

The following should work

 PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook 
--no-browser --port=7000" /root/spark/bin/pyspark

I am able to connect to the notebook server and start a notebook how ever

bug 1) the default sparkContext does not exist

from pyspark import SparkContext
textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
textFile.take(3

---
NameError Traceback (most recent call last)
 in ()
  1 from pyspark import SparkContext
> 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
  3 textFile.take(3)

NameError: name 'sc' is not defined

bug 2)
 If I create a SparkContext I get the following python versions miss match error

sc = SparkContext("local", "Simple App")
textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
textFile.take(3)

 File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver 2.6, 
PySpark cannot run with different minor versions


I am able to run ipython notebooks on my local Mac as follows. (by default you 
would get an error that the driver and works are using different version of 
python)

$ cat ~/bin/pySparkNotebook.sh
#!/bin/sh 

set -x # turn debugging on
#set +x # turn debugging off

export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ 


I have spent a lot of time trying to debug the pyspark script however I can not 
figure out what the problem is

Please let me know if there is something I can do to help

Andy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script

2015-11-04 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990746#comment-14990746
 ] 

Andrew Davidson commented on SPARK-11509:
-

I forgot to mentioned. on my cluster master I was able to run 

bin/pyspark --master local[2] 

without any problems . I was able to access sc with any issues



> ipython notebooks do not work on clusters created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
> --
>
> Key: SPARK-11509
> URL: https://issues.apache.org/jira/browse/SPARK-11509
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, EC2, PySpark
>Affects Versions: 1.5.1
> Environment: AWS cluster
> [ec2-user@ip-172-31-29-60 ~]$ uname -a
> Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 
> SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Andrew Davidson
>
> I recently downloaded  spark-1.5.1-bin-hadoop2.6 to my local mac.
> I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am 
> able to run the java SparkPi example on the cluster how ever I am not able to 
> run ipython notebooks on the cluster. (I connect using ssh tunnel)
> According to the 1.5.1 getting started doc 
> http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell
> The following should work
>  PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook 
> --no-browser --port=7000" /root/spark/bin/pyspark
> I am able to connect to the notebook server and start a notebook how ever
> bug 1) the default sparkContext does not exist
> from pyspark import SparkContext
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3
> ---
> NameError Traceback (most recent call last)
>  in ()
>   1 from pyspark import SparkContext
> > 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
>   3 textFile.take(3)
> NameError: name 'sc' is not defined
> bug 2)
>  If I create a SparkContext I get the following python versions miss match 
> error
> sc = SparkContext("local", "Simple App")
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3)
>  File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
> ("%d.%d" % sys.version_info[:2], version))
> Exception: Python in worker has different version 2.7 than that in driver 
> 2.6, PySpark cannot run with different minor versions
> I am able to run ipython notebooks on my local Mac as follows. (by default 
> you would get an error that the driver and works are using different version 
> of python)
> $ cat ~/bin/pySparkNotebook.sh
> #!/bin/sh 
> set -x # turn debugging on
> #set +x # turn debugging off
> export PYSPARK_PYTHON=python3
> export PYSPARK_DRIVER_PYTHON=python3
> IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ 
> I have spent a lot of time trying to debug the pyspark script however I can 
> not figure out what the problem is
> Please let me know if there is something I can do to help
> Andy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11509) ipython notebooks do not work on clusters created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script

2015-11-04 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14990742#comment-14990742
 ] 

Andrew Davidson commented on SPARK-11509:
-

yes , it appears the show stopper issue I am facing is the python versions to 
do not match. How ever on my local mac I was able to figure out how to get 
everything to match. That technique does not work on a spark cluster. I tried a 
lot of hacking how ever can not seem to get the version to match. I plan to 
install python 3 on all machines maybe that will work better

kind regards

Andy

> ipython notebooks do not work on clusters created using 
> spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script
> --
>
> Key: SPARK-11509
> URL: https://issues.apache.org/jira/browse/SPARK-11509
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, EC2, PySpark
>Affects Versions: 1.5.1
> Environment: AWS cluster
> [ec2-user@ip-172-31-29-60 ~]$ uname -a
> Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 
> SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Andrew Davidson
>
> I recently downloaded  spark-1.5.1-bin-hadoop2.6 to my local mac.
> I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am 
> able to run the java SparkPi example on the cluster how ever I am not able to 
> run ipython notebooks on the cluster. (I connect using ssh tunnel)
> According to the 1.5.1 getting started doc 
> http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell
> The following should work
>  PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook 
> --no-browser --port=7000" /root/spark/bin/pyspark
> I am able to connect to the notebook server and start a notebook how ever
> bug 1) the default sparkContext does not exist
> from pyspark import SparkContext
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3
> ---
> NameError Traceback (most recent call last)
>  in ()
>   1 from pyspark import SparkContext
> > 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
>   3 textFile.take(3)
> NameError: name 'sc' is not defined
> bug 2)
>  If I create a SparkContext I get the following python versions miss match 
> error
> sc = SparkContext("local", "Simple App")
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3)
>  File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main
> ("%d.%d" % sys.version_info[:2], version))
> Exception: Python in worker has different version 2.7 than that in driver 
> 2.6, PySpark cannot run with different minor versions
> I am able to run ipython notebooks on my local Mac as follows. (by default 
> you would get an error that the driver and works are using different version 
> of python)
> $ cat ~/bin/pySparkNotebook.sh
> #!/bin/sh 
> set -x # turn debugging on
> #set +x # turn debugging off
> export PYSPARK_PYTHON=python3
> export PYSPARK_DRIVER_PYTHON=python3
> IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ 
> I have spent a lot of time trying to debug the pyspark script however I can 
> not figure out what the problem is
> Please let me know if there is something I can do to help
> Andy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7

2014-10-12 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168787#comment-14168787
 ] 

Andrew Davidson commented on SPARK-922:
---

Wow upgrading matplotlib was a bear. The following worked for me. The trick was 
getting the correct version of the source code. The recipe bellow is not 100% 
correct. I have not figured out how to use pssh with yum. yum prompts you y/n 
before downloading 

pip2.7 install six
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install six

pip2.7 install python-dateutil
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install python-dateutil

pip2.7 install pyparsing
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install pyparsing


yum install yum-utils

wget https://github.com/matplotlib/matplotlib/archive/master.tar.gz
tar -zxvf master.tar.gz
cd matplotlib-master/
yum install freetype-devel
yum install libpng-devel
python2.7 setup.py build
python2.7 setup.py install

 Update Spark AMI to Python 2.7
 --

 Key: SPARK-922
 URL: https://issues.apache.org/jira/browse/SPARK-922
 Project: Spark
  Issue Type: Task
  Components: EC2, PySpark
Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.1.0
Reporter: Josh Rosen

 Many Python libraries only support Python 2.7+, so we should make Python 2.7 
 the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7

2014-10-12 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168788#comment-14168788
 ] 

Andrew Davidson commented on SPARK-922:
---

also forgot the mention there are a couple of steps on 
http://nbviewer.ipython.org/gist/JoshRosen/6856670

that are important in the upgrade process

#
# restart spark
#
/root/spark/sbin/stop-all.sh
/root/spark/sbin/start-all.sh


 Update Spark AMI to Python 2.7
 --

 Key: SPARK-922
 URL: https://issues.apache.org/jira/browse/SPARK-922
 Project: Spark
  Issue Type: Task
  Components: EC2, PySpark
Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.1.0
Reporter: Josh Rosen

 Many Python libraries only support Python 2.7+, so we should make Python 2.7 
 the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-922) Update Spark AMI to Python 2.7

2014-09-29 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150695#comment-14150695
 ] 

Andrew Davidson edited comment on SPARK-922 at 9/29/14 7:05 PM:


here is how I am launching iPython notebook. I am running as the ec2-user
IPYTHON_OPTS=notebook --pylab inline --no-browser --port=7000 
$SPARK_HOME/bin/pyspark

Bellow are all the upgrade commands I ran 

I ran into a small problem the ipython magic %matplotlib inline

creates an error, you can work around this by commenting it out. 

Andy

yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | 
python27
pssh -h /root/spark-ec2/slaves wget 
https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27
easy_install-2.7 pip
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pip2.7 install numpy
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy
pip2.7 install ipython[all]
printf \n# Set Spark Python version\nexport 
PYSPARK_PYTHON=/usr/bin/python2.7\n  /root/spark/conf/spark-env.sh
source /root/spark/conf/spark-env.sh




was (Author: aedwip):

here is how I am launching iPython notebook. I am running as the ec2-user
IPYTHON_OPTS=notebook --pylab inline --no-browser --port=7000 
$SPARK_HOME/bin/pyspark

Bellow are all the upgrade commands I ran 

Any idea what I missed?

Andy

yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | 
python27
pssh -h /root/spark-ec2/slaves wget 
https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27
easy_install-2.7 pip
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pip2.7 install numpy
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy
pip2.7 install ipython[all]
printf \n# Set Spark Python version\nexport 
PYSPARK_PYTHON=/usr/bin/python2.7\n  /root/spark/conf/spark-env.sh
source /root/spark/conf/spark-env.sh



 Update Spark AMI to Python 2.7
 --

 Key: SPARK-922
 URL: https://issues.apache.org/jira/browse/SPARK-922
 Project: Spark
  Issue Type: Task
  Components: EC2, PySpark
Affects Versions: 0.9.0, 0.9.1, 1.0.0
Reporter: Josh Rosen
 Fix For: 1.2.0


 Many Python libraries only support Python 2.7+, so we should make Python 2.7 
 the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7

2014-09-27 Thread Andrew Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150695#comment-14150695
 ] 

Andrew Davidson commented on SPARK-922:
---

I must have missed something. I am running iPython notebook over a ssh tunnel. 
I am still running using the old version. I made sure to

export PYSPARK_PYTHON=python2.7

I also tried
export PYSPARK_PYTHON=/usr/bin/python2.7\

import IPython
print IPython.sys_info()
{'commit_hash': '858d539',
 'commit_source': 'installation',
 'default_encoding': 'UTF-8',
 'ipython_path': 
'/usr/lib/python2.6/site-packages/ipython-0.13.2-py2.6.egg/IPython',
 'ipython_version': '0.13.2',
 'os_name': 'posix',
 'platform': 'Linux-3.4.37-40.44.amzn1.x86_64-x86_64-with-glibc2.2.5',
 'sys_executable': '/usr/bin/python2.6',
 'sys_platform': 'linux2',
 'sys_version': '2.6.9 (unknown, Sep 13 2014, 00:25:11) \n[GCC 4.8.2 20140120 
(Red Hat 4.8.2-16)]'}


here is how I am launching iPython notebook. I am running as the ec2-user
IPYTHON_OPTS=notebook --pylab inline --no-browser --port=7000 
$SPARK_HOME/bin/pyspark

Bellow are all the upgrade commands I ran 

Any idea what I missed?

Andy

yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | 
python27
pssh -h /root/spark-ec2/slaves wget 
https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27
easy_install-2.7 pip
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pip2.7 install numpy
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy
pip2.7 install ipython[all]
printf \n# Set Spark Python version\nexport 
PYSPARK_PYTHON=/usr/bin/python2.7\n  /root/spark/conf/spark-env.sh
source /root/spark/conf/spark-env.sh



 Update Spark AMI to Python 2.7
 --

 Key: SPARK-922
 URL: https://issues.apache.org/jira/browse/SPARK-922
 Project: Spark
  Issue Type: Task
  Components: EC2, PySpark
Affects Versions: 0.9.0, 0.9.1, 1.0.0
Reporter: Josh Rosen
 Fix For: 1.2.0


 Many Python libraries only support Python 2.7+, so we should make Python 2.7 
 the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org