Re: Can Pyspark access Scala API?

2016-05-18 Thread Abi
Thanks for that. But the question is more general. Can pyspark access Scala 
somehow ?

On May 18, 2016 3:53:50 PM EDT, Ted Yu <yuzhih...@gmail.com> wrote:
>Not sure if you have seen this (for 2.0):
>
>[SPARK-15087][CORE][SQL] Remove AccumulatorV2.localValue and keep only
>value
>
>Can you tell us your use case ?
>
>On Tue, May 17, 2016 at 9:16 PM, Abi <analyst.tech.j...@gmail.com>
>wrote:
>
>> Can Pyspark access Scala API? The accumulator in pysPark does not
>have
>> local variable available . The Scala API does have it available


Can Pyspark access Scala API?

2016-05-17 Thread Abi
Can Pyspark access Scala API? The accumulator in pysPark does not have local 
variable available . The Scala API does have it available

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Abi
Please include me too

On May 12, 2016 6:08:14 AM EDT, Mich Talebzadeh  
wrote:
>Hi Al,,
>
>
>Following the threads in spark forum, I decided to write up on
>configuration of Spark including allocation of resources and
>configuration
>of driver, executors, threads, execution of Spark apps and general
>troubleshooting taking into account the allocation of resources for
>Spark
>applications and OS tools at the disposal.
>
>Since the most widespread configuration as I notice is with "Spark
>Standalone Mode", I have decided to write these notes starting with
>Standalone and later on moving to Yarn
>
>
>   -
>
> *Standalone *– a simple cluster manager included with Spark that makes
>   it easy to set up a cluster.
>   -
>
>   *YARN* – the resource manager in Hadoop 2.
>
>
>I would appreciate if anyone interested in reading and commenting to
>get in
>touch with me directly on mich.talebza...@gmail.com so I can send the
>write-up for their review and comments.
>
>
>Just to be clear this is not meant to be any commercial proposition or
>anything like that. As I seem to get involved with members
>troubleshooting
>issues and threads on this topic, I thought it is worthwhile writing a
>note
>about it to summarise the findings for the benefit of the community.
>
>
>Regards.
>
>
>Dr Mich Talebzadeh
>
>
>
>LinkedIn *
>https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>*
>
>
>
>http://talebzadehmich.wordpress.com


Re: Pyspark accumulator

2016-05-13 Thread Abi
On Tue, May 10, 2016 at 2:24 PM, Abi <analyst.tech.j...@gmail.com> wrote:

> 1. How come pyspark does not provide the localvalue function like scala ?
>
> 2. Why is pyspark more restrictive than scala ?


Re: pyspark mappartions ()

2016-05-13 Thread Abi
On Tue, May 10, 2016 at 2:20 PM, Abi <analyst.tech.j...@gmail.com> wrote:

> Is there any example of this ? I want to see how you write the the
> iterable example


broadcast variable not picked up

2016-05-13 Thread abi
def kernel(arg):
input = broadcast_var.value + 1
#some processing with input

def foo():
  
  
  broadcast_var = sc.broadcast(var)
  rdd.foreach(kernel)


def main():
   #something


In this code , I get the following error:
NameError: global name 'broadcast_var ' is not defined


Any ideas on how to fix it ?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/broadcast-variable-not-picked-up-tp26955.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



pandas dataframe broadcasted. giving errors in datanode function called kernel

2016-05-13 Thread abi
pandas dataframe is broadcasted successfully. giving errors in datanode
function called kernel

Code:

dataframe_broadcast  = sc.broadcast(dataframe)

def kernel():
df_v = dataframe_broadcast.value


Error:

I get this error when I try accessing the value member of the broadcast
variable. Apprently it does not have a value, hence it tries to load from
the file again.

  File
"C:\spark-1.6.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\broadcast.py",
line 97, in value
self._value = self.load(self._path)
  File
"C:\spark-1.6.1-bin-hadoop2.6\python\lib\pyspark.zip\pyspark\broadcast.py",
line 88, in load
return pickle.load(f)
ImportError: No module named indexes.base

at
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at
org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/pandas-dataframe-broadcasted-giving-errors-in-datanode-function-called-kernel-tp26953.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pyspark accumulator

2016-05-10 Thread Abi


On May 10, 2016 2:24:41 PM EDT, Abi <analyst.tech.j...@gmail.com> wrote:
>1. How come pyspark does not provide the localvalue function like scala
>?
>
>2. Why is pyspark more restrictive than scala ?


Re: Accumulator question

2016-05-10 Thread Abi


On May 9, 2016 8:24:06 PM EDT, Abi <analyst.tech.j...@gmail.com> wrote:
>I am splitting an integer array in 2 partitions and using an
>accumulator  to sum the array. problem is
>
>1.  I am not seeing execution time becoming half of a linear summing.
>
>2. The second node (from looking at timestamps) takes 3 times as long
>as the first node. This gives the impression it is "waiting" for the
>first node to finish.
>
>Hence,  I am given the impression using accumulator.sum () in the
>kernel and rdd.foreach (kernel) is making things sequential. 
>
>Any api/setting suggestions where I could make things parallel ?
>
>
> 


Re: pyspark mappartions ()

2016-05-10 Thread Abi


On May 10, 2016 2:20:25 PM EDT, Abi <analyst.tech.j...@gmail.com> wrote:
>Is there any example of this ? I want to see how you write the the
>iterable example


Hi test

2016-05-10 Thread Abi
Hello test

Pyspark accumulator

2016-05-10 Thread Abi
1. How come pyspark does not provide the localvalue function like scala ?

2. Why is pyspark more restrictive than scala ?

pyspark mappartions ()

2016-05-10 Thread Abi
Is there any example of this ? I want to see how you write the the iterable 
example

Re: Accumulator question

2016-05-09 Thread Abi
I am splitting an integer array in 2 partitions and using an accumulator to
sum the array. problem is

1. I am not seeing execution time becoming half of a linear summing.

2. The second node (from looking at timestamps) takes 3 times as long as
the first node. This gives the impression it is "waiting" for the first
node to finish.

Hence, I am given the impression using accumulator.sum () in the kernel and
rdd.foreach (kernel) is making things sequential.

Any api/setting suggestions where I could make things parallel ?

On Mon, May 9, 2016 at 8:24 PM, Abi <analyst.tech.j...@gmail.com> wrote:

> I am splitting an integer array in 2 partitions and using an accumulator
> to sum the array. problem is
>
> 1. I am not seeing execution time becoming half of a linear summing.
>
> 2. The second node (from looking at timestamps) takes 3 times as long as
> the first node. This gives the impression it is "waiting" for the first
> node to finish.
>
> Hence, I am given the impression using accumulator.sum () in the kernel
> and rdd.foreach (kernel) is making things sequential.
>
> Any api/setting suggestions where I could make things parallel ?
>
>
>


Accumulator question

2016-05-09 Thread Abi
I am splitting an integer array in 2 partitions and using an accumulator  to 
sum the array. problem is

1.  I am not seeing execution time becoming half of a linear summing.

2. The second node (from looking at timestamps) takes 3 times as long as the 
first node. This gives the impression it is "waiting" for the first node to 
finish.

Hence,  I am given the impression using accumulator.sum () in the kernel and 
rdd.foreach (kernel) is making things sequential. 

Any api/setting suggestions where I could make things parallel ?