RE: pyspark in intellij

2017-02-25 Thread Sidney Feiner
Yes, I got it working once but I can't exactly remember how.
I think what I did was the following:

· To the environment variables, add a variable named PYTHONPATH with 
the path to your pyspark python directory (in my case, 
C:\spark-2.1.0-bin-hadoop2.7\python)

· To the environment variable, add the same path as above to the PATH 
variable

Hope these work ☺


Sidney Feiner / SW Developer
M: +972.528197720 / Skype: sidney.feiner.startapp

[emailsignature]

From: Stephen Boesch [mailto:java...@gmail.com]
Sent: Sunday, February 26, 2017 3:56 AM
To: user 
Subject: pyspark in intellij

Anyone have this working - either in 1.X or 2.X?

thanks


In Spark streaming, will saved kafka offsets become invalid if I change the number of partitions in a kafka topic?

2017-02-25 Thread shyla deshpande
I am commiting offsets to Kafka after my output has been stored, using the
commitAsync API.

My question is if I increase/decrease the number of kafka partitions, will
the saved offsets will become invalid.

Thanks


Spark test error in ProactiveClosureSerializationSuite.scala

2017-02-25 Thread ??????????
hello all, I am building Spark1.6.2 and I meet a problem when doing mvn test


The command is mvn -e -Pyarn  -Phive -Phive-thriftserver  
-DwildcardSuites=org.apache.spark.serializer.ProactiveClosureSerializationSuite 
test
and the test error is
ProactiveClosureSerializationSuite:
- throws expected serialization exceptions on actions
- mapPartitions transformations throw proactive serialization exceptions *** 
FAILED ***
  Expected exception org.apache.spark.SparkException to be thrown, but no 
exception was thrown. (ProactiveClosureSerializationSuite.scala:58)
- map transformations throw proactive serialization exceptions
- filter transformations throw proactive serialization exceptions
- flatMap transformations throw proactive serialization exceptions
- mapPartitionsWithIndex transformations throw proactive serialization 
exceptions *** FAILED ***
  Expected exception org.apache.spark.SparkException to be thrown, but no 
exception was thrown. (ProactiveClosureSerializationSuite.scala:58)



I think this test is about task not serializable, but why do I only get test 
error on mapPartitions and mapPartitionsWithIndex?


Thanks.

Spark SQL table authority control?

2017-02-25 Thread 李斌松
Through the JDBC connection spark thriftserver, execte hive SQL, check
whether the table read or write permission to expand hook in hive on spark,
you can control permissions, spark on hive what is the point of expansion?


pyspark in intellij

2017-02-25 Thread Stephen Boesch
Anyone have this working - either in 1.X or 2.X?

thanks


Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Marco Mistroni
Try to use --packages to include the jars. From error it seems it's looking
for main class in jars but u r running a python script...

On 25 Feb 2017 10:36 pm, "Raymond Xie"  wrote:

That's right Anahita, however, the class name is not indicated in the
original github project so I don't know what class should be used here. The
github only says:
and then run the example
`$ bin/spark-submit --jars \
external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar
\
examples/src/main/python/streaming/kafka_wordcount.py \
localhost:2181 test`
""" Can anyone give any thought on how to find out? Thank you very much in
advance.


**
*Sincerely yours,*


*Raymond*

On Sat, Feb 25, 2017 at 5:27 PM, Anahita Talebi 
wrote:

> You're welcome.
> You need to specify the class. I meant like that:
>
> spark-submit  /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.
> 0-1245-hadoop2.7.3.2.5.0.0-1245.jar --class "give the name of the class"
>
>
>
> On Saturday, February 25, 2017, Raymond Xie  wrote:
>
>> Thank you, it is still not working:
>>
>> [image: Inline image 1]
>>
>> By the way, here is the original source:
>>
>> https://github.com/apache/spark/blob/master/examples/src/mai
>> n/python/streaming/kafka_wordcount.py
>>
>>
>> **
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>> On Sat, Feb 25, 2017 at 4:48 PM, Anahita Talebi <
>> anahita.t.am...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I think if you remove --jars, it will work. Like:
>>>
>>> spark-submit  /usr/hdp/2.5.0.0-1245/spark/l
>>> ib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>>
>>>  I had the same problem before and solved it by removing --jars.
>>>
>>> Cheers,
>>> Anahita
>>>
>>> On Saturday, February 25, 2017, Raymond Xie 
>>> wrote:
>>>
 I am doing a spark streaming on a hortonworks sandbox and am stuck here
 now, can anyone tell me what's wrong with the following code and the
 exception it causes and how do I fix it? Thank you very much in advance.

 spark-submit --jars /usr/hdp/2.5.0.0-1245/spark/li
 b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
 /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test

 Error:
 No main class set in JAR; please specify one with --class


 spark-submit --class /usr/hdp/2.5.0.0-1245/spark/li
 b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
 /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test

 Error:
 java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/spark/li
 b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar

 spark-submit --class  /usr/hdp/2.5.0.0-1245/kafka/l
 ibs/kafka-streams-0.10.0.2.5.0.0-1245.jar
 /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0
 -1245-hadoop2.7.3.2.5.0.0-1245.jar  /root/hdp/kafka_wordcount.py
 192.168.128.119:2181 test

 Error:
 java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/kafka/li
 bs/kafka-streams-0.10.0.2.5.0.0-1245.jar

 **
 *Sincerely yours,*


 *Raymond*

>>>
>>


Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
Thank you very much Marco,

I am a beginner in this area, is it possible for you to show me what you
think the right script should be to get it executed in terminal?


**
*Sincerely yours,*


*Raymond*

On Sat, Feb 25, 2017 at 6:00 PM, Marco Mistroni  wrote:

> Try to use --packages to include the jars. From error it seems it's
> looking for main class in jars but u r running a python script...
>
> On 25 Feb 2017 10:36 pm, "Raymond Xie"  wrote:
>
> That's right Anahita, however, the class name is not indicated in the
> original github project so I don't know what class should be used here. The
> github only says:
> and then run the example
> `$ bin/spark-submit --jars \
> external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar
> \
> examples/src/main/python/streaming/kafka_wordcount.py \
> localhost:2181 test`
> """ Can anyone give any thought on how to find out? Thank you very much
> in advance.
>
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>
> On Sat, Feb 25, 2017 at 5:27 PM, Anahita Talebi  > wrote:
>
>> You're welcome.
>> You need to specify the class. I meant like that:
>>
>> spark-submit  /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.
>> 0-1245-hadoop2.7.3.2.5.0.0-1245.jar --class "give the name of the class"
>>
>>
>>
>> On Saturday, February 25, 2017, Raymond Xie  wrote:
>>
>>> Thank you, it is still not working:
>>>
>>> [image: Inline image 1]
>>>
>>> By the way, here is the original source:
>>>
>>> https://github.com/apache/spark/blob/master/examples/src/mai
>>> n/python/streaming/kafka_wordcount.py
>>>
>>>
>>> **
>>> *Sincerely yours,*
>>>
>>>
>>> *Raymond*
>>>
>>> On Sat, Feb 25, 2017 at 4:48 PM, Anahita Talebi <
>>> anahita.t.am...@gmail.com> wrote:
>>>
 Hi,

 I think if you remove --jars, it will work. Like:

 spark-submit  /usr/hdp/2.5.0.0-1245/spark/l
 ib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar

  I had the same problem before and solved it by removing --jars.

 Cheers,
 Anahita

 On Saturday, February 25, 2017, Raymond Xie 
 wrote:

> I am doing a spark streaming on a hortonworks sandbox and am stuck
> here now, can anyone tell me what's wrong with the following code and the
> exception it causes and how do I fix it? Thank you very much in advance.
>
> spark-submit --jars /usr/hdp/2.5.0.0-1245/spark/li
> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
> /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>
> Error:
> No main class set in JAR; please specify one with --class
>
>
> spark-submit --class /usr/hdp/2.5.0.0-1245/spark/li
> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
> /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>
> Error:
> java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/spark/li
> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>
> spark-submit --class  /usr/hdp/2.5.0.0-1245/kafka/l
> ibs/kafka-streams-0.10.0.2.5.0.0-1245.jar
> /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0
> -1245-hadoop2.7.3.2.5.0.0-1245.jar  /root/hdp/kafka_wordcount.py
> 192.168.128.119:2181 test
>
> Error:
> java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/kafka/li
> bs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>

>>>
>
>


Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
That's right Anahita, however, the class name is not indicated in the
original github project so I don't know what class should be used here. The
github only says:
and then run the example
`$ bin/spark-submit --jars \
external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar
\
examples/src/main/python/streaming/kafka_wordcount.py \
localhost:2181 test`
""" Can anyone give any thought on how to find out? Thank you very much in
advance.


**
*Sincerely yours,*


*Raymond*

On Sat, Feb 25, 2017 at 5:27 PM, Anahita Talebi 
wrote:

> You're welcome.
> You need to specify the class. I meant like that:
>
> spark-submit  /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.
> 0-1245-hadoop2.7.3.2.5.0.0-1245.jar --class "give the name of the class"
>
>
>
> On Saturday, February 25, 2017, Raymond Xie  wrote:
>
>> Thank you, it is still not working:
>>
>> [image: Inline image 1]
>>
>> By the way, here is the original source:
>>
>> https://github.com/apache/spark/blob/master/examples/src/mai
>> n/python/streaming/kafka_wordcount.py
>>
>>
>> **
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>> On Sat, Feb 25, 2017 at 4:48 PM, Anahita Talebi <
>> anahita.t.am...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I think if you remove --jars, it will work. Like:
>>>
>>> spark-submit  /usr/hdp/2.5.0.0-1245/spark/l
>>> ib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>>
>>>  I had the same problem before and solved it by removing --jars.
>>>
>>> Cheers,
>>> Anahita
>>>
>>> On Saturday, February 25, 2017, Raymond Xie 
>>> wrote:
>>>
 I am doing a spark streaming on a hortonworks sandbox and am stuck here
 now, can anyone tell me what's wrong with the following code and the
 exception it causes and how do I fix it? Thank you very much in advance.

 spark-submit --jars /usr/hdp/2.5.0.0-1245/spark/li
 b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
 /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test

 Error:
 No main class set in JAR; please specify one with --class


 spark-submit --class /usr/hdp/2.5.0.0-1245/spark/li
 b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
 /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test

 Error:
 java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/spark/li
 b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar

 spark-submit --class  /usr/hdp/2.5.0.0-1245/kafka/l
 ibs/kafka-streams-0.10.0.2.5.0.0-1245.jar
 /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0
 -1245-hadoop2.7.3.2.5.0.0-1245.jar  /root/hdp/kafka_wordcount.py
 192.168.128.119:2181 test

 Error:
 java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/kafka/li
 bs/kafka-streams-0.10.0.2.5.0.0-1245.jar

 **
 *Sincerely yours,*


 *Raymond*

>>>
>>


Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Anahita Talebi
You're welcome.
You need to specify the class. I meant like that:

spark-submit  /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.
0-1245-hadoop2.7.3.2.5.0.0-1245.jar --class "give the name of the class"



On Saturday, February 25, 2017, Raymond Xie  wrote:

> Thank you, it is still not working:
>
> [image: Inline image 1]
>
> By the way, here is the original source:
>
> https://github.com/apache/spark/blob/master/examples/
> src/main/python/streaming/kafka_wordcount.py
>
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>
> On Sat, Feb 25, 2017 at 4:48 PM, Anahita Talebi  > wrote:
>
>> Hi,
>>
>> I think if you remove --jars, it will work. Like:
>>
>> spark-submit  /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.
>> 0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>
>>  I had the same problem before and solved it by removing --jars.
>>
>> Cheers,
>> Anahita
>>
>> On Saturday, February 25, 2017, Raymond Xie > > wrote:
>>
>>> I am doing a spark streaming on a hortonworks sandbox and am stuck here
>>> now, can anyone tell me what's wrong with the following code and the
>>> exception it causes and how do I fix it? Thank you very much in advance.
>>>
>>> spark-submit --jars /usr/hdp/2.5.0.0-1245/spark/li
>>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>>  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>>> /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>>>
>>> Error:
>>> No main class set in JAR; please specify one with --class
>>>
>>>
>>> spark-submit --class /usr/hdp/2.5.0.0-1245/spark/li
>>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>>  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>>> /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>>>
>>> Error:
>>> java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/spark/li
>>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>>
>>> spark-submit --class  /usr/hdp/2.5.0.0-1245/kafka/l
>>> ibs/kafka-streams-0.10.0.2.5.0.0-1245.jar /usr/hdp/2.5.0.0-1245/spark/li
>>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>>  /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>>>
>>> Error:
>>> java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/kafka/li
>>> bs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>>>
>>> **
>>> *Sincerely yours,*
>>>
>>>
>>> *Raymond*
>>>
>>
>


Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
Thank you, it is still not working:

[image: Inline image 1]

By the way, here is the original source:

https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/kafka_wordcount.py


**
*Sincerely yours,*


*Raymond*

On Sat, Feb 25, 2017 at 4:48 PM, Anahita Talebi 
wrote:

> Hi,
>
> I think if you remove --jars, it will work. Like:
>
> spark-submit  /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.
> 0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>
>  I had the same problem before and solved it by removing --jars.
>
> Cheers,
> Anahita
>
> On Saturday, February 25, 2017, Raymond Xie  wrote:
>
>> I am doing a spark streaming on a hortonworks sandbox and am stuck here
>> now, can anyone tell me what's wrong with the following code and the
>> exception it causes and how do I fix it? Thank you very much in advance.
>>
>> spark-submit --jars /usr/hdp/2.5.0.0-1245/spark/li
>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>> /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>>
>> Error:
>> No main class set in JAR; please specify one with --class
>>
>>
>> spark-submit --class /usr/hdp/2.5.0.0-1245/spark/li
>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>> /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>>
>> Error:
>> java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/spark/li
>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>
>> spark-submit --class  /usr/hdp/2.5.0.0-1245/kafka/l
>> ibs/kafka-streams-0.10.0.2.5.0.0-1245.jar /usr/hdp/2.5.0.0-1245/spark/li
>> b/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>>  /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>>
>> Error:
>> java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/kafka/li
>> bs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>>
>> **
>> *Sincerely yours,*
>>
>>
>> *Raymond*
>>
>


Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Anahita Talebi
Hi,

I think if you remove --jars, it will work. Like:

spark-submit  /usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.
0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar

 I had the same problem before and solved it by removing --jars.

Cheers,
Anahita

On Saturday, February 25, 2017, Raymond Xie  wrote:

> I am doing a spark streaming on a hortonworks sandbox and am stuck here
> now, can anyone tell me what's wrong with the following code and the
> exception it causes and how do I fix it? Thank you very much in advance.
>
> spark-submit --jars /usr/hdp/2.5.0.0-1245/spark/
> lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
> /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>
> Error:
> No main class set in JAR; please specify one with --class
>
>
> spark-submit --class /usr/hdp/2.5.0.0-1245/spark/
> lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
> /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>
> Error:
> java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/spark/
> lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>
> spark-submit --class  /usr/hdp/2.5.0.0-1245/kafka/
> libs/kafka-streams-0.10.0.2.5.0.0-1245.jar /usr/hdp/2.5.0.0-1245/spark/
> lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
>  /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test
>
> Error:
> java.lang.ClassNotFoundException: /usr/hdp/2.5.0.0-1245/kafka/
> libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>


Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread yohann jardin
You should read (again?) the Spark documentation about submitting an 
application: http://spark.apache.org/docs/latest/submitting-applications.html

Try with the Pi computation example available with Spark.
For example:

./bin/spark-submit --class org.apache.spark.examples.SparkPi 
examples/jars/spark-examples*.jar

after --class you specify the path, in your provided jar, to the Main you want 
to run. You finish by specifying the jar that contains your main class.

Yohann Jardin

Le 2/25/2017 à 9:50 PM, Raymond Xie a écrit :
I am doing a spark streaming on a hortonworks sandbox and am stuck here now, 
can anyone tell me what's wrong with the following code and the exception it 
causes and how do I fix it? Thank you very much in advance.

spark-submit --jars 
/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar 
/root/hdp/kafka_wordcount.py 192.168.128.119:2181 
test

Error:
No main class set in JAR; please specify one with --class


spark-submit --class 
/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
  /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar 
/root/hdp/kafka_wordcount.py 192.168.128.119:2181 
test

Error:
java.lang.ClassNotFoundException: 
/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar

spark-submit --class  
/usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar 
/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
  /root/hdp/kafka_wordcount.py 
192.168.128.119:2181 test

Error:
java.lang.ClassNotFoundException: 
/usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar


Sincerely yours,


Raymond



No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Raymond Xie
I am doing a spark streaming on a hortonworks sandbox and am stuck here
now, can anyone tell me what's wrong with the following code and the
exception it causes and how do I fix it? Thank you very much in advance.

spark-submit --jars
/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
 /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
/root/hdp/kafka_wordcount.py 192.168.128.119:2181 test

Error:
No main class set in JAR; please specify one with --class


spark-submit --class
/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
 /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
/root/hdp/kafka_wordcount.py 192.168.128.119:2181 test

Error:
java.lang.ClassNotFoundException:
/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar

spark-submit --class
 /usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar
/usr/hdp/2.5.0.0-1245/spark/lib/spark-assembly-1.6.2.2.5.0.0-1245-hadoop2.7.3.2.5.0.0-1245.jar
 /root/hdp/kafka_wordcount.py 192.168.128.119:2181 test

Error:
java.lang.ClassNotFoundException:
/usr/hdp/2.5.0.0-1245/kafka/libs/kafka-streams-0.10.0.2.5.0.0-1245.jar

**
*Sincerely yours,*


*Raymond*


PySpark + virtualenv: Using a different python path on the driver and on the executors

2017-02-25 Thread Tomer Benyamini
Hello,

I'm trying to run pyspark using the following setup:

- spark 1.6.1 standalone cluster on ec2
- virtualenv installed on master

- app is run using the following command:

export PYSPARK_DRIVER_PYTHON=/path_to_virtualenv/bin/python
export PYSPARK_PYTHON=/usr/bin/python
/root/spark/bin/spark-submit --py-files mypackage.tar.gz myapp.py

I'm getting the following error:

java.io.IOException: Cannot run program
"/path_to_virtualenv/bin/python": error=2, No such file or directory
   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)

--> Looks like the executor process did not account for the PYSPARK_PYTHON
setting, but used the same python executable it had on the driver (the
virtualenv python), rather than using "
/usr/bin/python"

What am I doing wrong here?

Thanks,
Tomer


Spark runs out of memory with small file

2017-02-25 Thread Henry Tremblay
I am reading in a single small file from hadoop with wholeText. If I 
process each line and create a row with two cells, the first cell equal 
to the name of the file, the second cell equal to the line. That code 
runs fine.


But if I just add two line of code and change the first cell based on 
parsing a line, spark runs out of memory. Any idea why such a simple 
process that would succeed quickly in a non spark application fails?


Thanks!

Henry

CODE:

[hadoop@ip-172-31-35-67 ~]$ hadoop fs -du /mnt/temp
3816096 
/mnt/temp/CC-MAIN-20170116095123-00570-ip-10-171-10-70.ec2.internal.warc.gz



In [1]: rdd1 = sc.wholeTextFiles("/mnt/temp")
In [2]: rdd1.count()
Out[2]: 1


In [4]: def process_file(s):
   ...: text = s[1]
   ...: the_id = s[0]
   ...: d = {}
   ...: l =  text.split("\n")
   ...: final = []
   ...: for line in l:
   ...: d[the_id] = line
   ...: final.append(Row(**d))
   ...: return final
   ...:

In [5]: rdd2 = rdd1.map(process_file)

In [6]: rdd2.count()
Out[6]: 1

In [7]: rdd3 = rdd2.flatMap(lambda x: x)

In [8]: rdd3.count()
Out[8]: 508310

In [9]: rdd3.take(1)
Out[9]: 
[Row(hdfs://ip-172-31-35-67.us-west-2.compute.internal:8020/mnt/temp/CC-MAIN-20170116095123-00570-ip-10-171-10-70.ec2.internal.warc.gz='WARC/1.0\r')]


In [10]: def process_file(s):
...: text = s[1]
...: d = {}
...: l =  text.split("\n")
...: final = []
...: the_id = "init"
...: for line in l:
...: if line[0:15] == 'WARC-Record-ID:':
...: the_id = line[15:]
...: d[the_id] = line
...: final.append(Row(**d))
...: return final

In [12]: rdd2 = rdd1.map(process_file)

In [13]: rdd2.count()
17/02/25 19:03:03 ERROR YarnScheduler: Lost executor 5 on 
ip-172-31-41-89.us-west-2.compute.internal: Container killed by YARN for 
exceeding memory limits. 10.3 GB of 10.3 GB physical memory used. 
Consider boosting spark.yarn.executor.memoryOverhead.
17/02/25 19:03:03 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
Container killed by YARN for exceeding memory limits. 10.3 GB of 10.3 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
17/02/25 19:03:03 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 
5, ip-172-31-41-89.us-west-2.compute.internal, executor 5): 
ExecutorLostFailure (executor 5 exited caused by one of the running 
tasks) Reason: Container killed by YARN for exceeding memory limits. 
10.3 GB of 10.3 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.



--
Henry Tremblay
Robert Half Technology


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Get S3 Parquet File

2017-02-25 Thread Steve Loughran

On 24 Feb 2017, at 07:47, Femi Anthony 
> wrote:

Have you tried reading using s3n which is a slightly older protocol ? I'm not 
sure how compatible s3a is with older versions of Spark.

I would absolutely not use s3n with a 1.2 GB file.

There is a WONTFIX JIRA on how it will read to the end of a file when you close 
a stream, and as seek() closes a stream every seek will read to the end of a 
file. And as readFully(position, bytes) does a seek either end, every time the 
Parquet code tries to read a bit of data, 1.3 GV of download: 
https://issues.apache.org/jira/browse/HADOOP-12376

That is not going to be fixed, ever. Because it can only be done by upgrading 
the libraries, and that will simply move new bugs in, lead to different 
bugreports, etc, etc. All for a piece of code which has be supplanted in the 
hadoop-2.7.x JARs with s3a ready for use, and in the forthcoming hadoop-2.8+ 
code, significantly faster for IO (especially ORC/Parquet), multi-GB upload, 
and even the basic metadata operations used when setting up queries.

For Hadoop 2.7+, use S3a. Any issues with s3n will be closed as  "use s3a"




Femi

On Fri, Feb 24, 2017 at 2:18 AM, Benjamin Kim 
> wrote:
Hi Gourav,

My answers are below.

Cheers,
Ben


On Feb 23, 2017, at 10:57 PM, Gourav Sengupta 
> wrote:

Can I ask where are you running your CDH? Is it on premise or have you created 
a cluster for yourself in AWS? Our cluster in on premise in our data center.


you need to set  up your s3a credentials in core-site, spark-defaults, or rely 
on spark-submit picking up the submitters AWS env vars a propagating them.


Also I have really never seen use s3a before, that was used way long before 
when writing s3 files took a long time, but I think that you are reading it.

Anyideas why you are not migrating to Spark 2.1, besides speed, there are lots 
of apis which are new and the existing ones are being deprecated. Therefore 
there is a very high chance that you are already working on code which is being 
deprecated by the SPARK community right now. We use CDH and upgrade with 
whatever Spark version they include, which is 1.6.0. We are waiting for the 
move to Spark 2.0/2.1.

this is in the hadoop codebase, not the spark release. it will be the same 
irrsepectivel


And besides that would you not want to work on a platform which is at least 10 
times faster What would that be?

Regards,
Gourav Sengupta

On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim 
> wrote:
We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet 
file from AWS S3. We can read the schema and show some data when the file is 
loaded into a DataFrame, but when we try to do some operations, such as count, 
we get this error below.

com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS 
credentials from any provider in the chain
at 
com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)
at 
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)
at 
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)
at 
com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:385)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at 
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
at 
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:162)
at 
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:145)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:180)
at 
org.apache.spark.rdd.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:126)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at 

Re: RDD blocks on Spark Driver

2017-02-25 Thread liangyhg...@gmail.com
Hi, I think you are using the local model of Spark. There are mainly four models, which are local, standalone,  yarn and Mesos. Also, "blocks" is relative to hdfs, "partitions" is relative to spark.liangyihuai---Original---From: "Jacek Laskowski "Date: 2017/2/25 02:45:20To: "prithish";Cc: "user";Subject: Re: RDD blocks on Spark DriverHi, Guess you're use local mode which has only one executor called driver. Is my guessing correct? JacekOn 23 Feb 2017 2:03 a.m.,   wrote:Hello,Had a question. When I look at the executors tab in Spark UI, I notice that some RDD blocks are assigned to the driver as well. Can someone please tell me why?Thanks for the help.


instrumenting Spark hit ratios

2017-02-25 Thread Mich Talebzadeh
One of the ways of ingesting data into HDFS is to use Spark JDBC connection
to connect to soured and ingest data into the underlying files or Hive
tables.

One question has come out is under controlled test conditions what would
the measurements of io, cpu etc across the cluster.

Assuming not using UNUX tools such as Nagios etc, are they tools that can
be deployed for spark cluster itself? I guess top/htop can be used but
those are available anyway.

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.