Re: Scala API: simplifying common patterns

2016-02-08 Thread Reynold Xin
Can you create a pull request? It is difficult to know what's going on.


On Mon, Feb 8, 2016 at 4:51 PM, sim  wrote:

> 24 test failures for sql/test:
> https://gist.github.com/ssimeonov/89862967f87c5c497322
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-API-simplifying-common-patterns-tp16238p16247.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-08 Thread Jonathan Kelly
Alex,

That's a very good question that I've been trying to answer myself recently
too. Since you've mentioned before that you're using EMR, I assume you're
asking this because you've noticed this behavior on emr-4.3.0.

In this release, we made some changes to the maximizeResourceAllocation
(which you may or may not be using, but either way this issue is present),
including the accidental inclusion of somewhat of a bug that makes it not
reserve any space for the AM, which ultimately results in one of the nodes
being utilized only by the AM and not an executor.

However, as you point out, the only viable fix seems to be to reserve
enough memory for the AM on *every single node*, which in some cases might
actually be worse than wasting a lot of memory on a single node.

So yeah, I also don't like either option. Is this just the price you pay
for running on YARN?


~ Jonathan
On Mon, Feb 8, 2016 at 9:03 PM Alexander Pivovarov 
wrote:

> Lets say that yarn has 53GB memory available on each slave
>
> spark.am container needs 896MB.  (512 + 384)
>
> I see two options to configure spark:
>
> 1. configure spark executors to use 52GB and leave 1 GB on each box. So,
> some box will also run am container. So, 1GB memory will not be used on all
> slaves but one.
>
> 2. configure spark to use all 53GB and add additional 53GB box which will
> run only am container. So, 52GB on this additional box will do nothing
>
> I do not like both options. Is there a better way to configure yarn/spark?
>
>
> Alex
>


spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-08 Thread Alexander Pivovarov
Lets say that yarn has 53GB memory available on each slave

spark.am container needs 896MB.  (512 + 384)

I see two options to configure spark:

1. configure spark executors to use 52GB and leave 1 GB on each box. So,
some box will also run am container. So, 1GB memory will not be used on all
slaves but one.

2. configure spark to use all 53GB and add additional 53GB box which will
run only am container. So, 52GB on this additional box will do nothing

I do not like both options. Is there a better way to configure yarn/spark?


Alex


Re: spark on yarn wastes one box (or 1 GB on each box) for am container

2016-02-08 Thread Sean Owen
Typically YARN is there because you're mediating resource requests
from things besides Spark, so yeah using every bit of the cluster is a
little bit of a corner case. There's not a good answer if all your
nodes are the same size.

I think you can let YARN over-commit RAM though, and allocate more
memory than it actually has. It may be beneficial to let them all
think they have an extra GB, and let one node running the AM
technically be overcommitted, a state which won't hurt at all unless
you're really really tight on memory, in which case something might
get killed.

On Tue, Feb 9, 2016 at 6:49 AM, Jonathan Kelly  wrote:
> Alex,
>
> That's a very good question that I've been trying to answer myself recently
> too. Since you've mentioned before that you're using EMR, I assume you're
> asking this because you've noticed this behavior on emr-4.3.0.
>
> In this release, we made some changes to the maximizeResourceAllocation
> (which you may or may not be using, but either way this issue is present),
> including the accidental inclusion of somewhat of a bug that makes it not
> reserve any space for the AM, which ultimately results in one of the nodes
> being utilized only by the AM and not an executor.
>
> However, as you point out, the only viable fix seems to be to reserve enough
> memory for the AM on *every single node*, which in some cases might actually
> be worse than wasting a lot of memory on a single node.
>
> So yeah, I also don't like either option. Is this just the price you pay for
> running on YARN?
>
>
> ~ Jonathan
>
> On Mon, Feb 8, 2016 at 9:03 PM Alexander Pivovarov 
> wrote:
>>
>> Lets say that yarn has 53GB memory available on each slave
>>
>> spark.am container needs 896MB.  (512 + 384)
>>
>> I see two options to configure spark:
>>
>> 1. configure spark executors to use 52GB and leave 1 GB on each box. So,
>> some box will also run am container. So, 1GB memory will not be used on all
>> slaves but one.
>>
>> 2. configure spark to use all 53GB and add additional 53GB box which will
>> run only am container. So, 52GB on this additional box will do nothing
>>
>> I do not like both options. Is there a better way to configure yarn/spark?
>>
>>
>> Alex

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Long running Spark job on YARN throws "No AMRMToken"

2016-02-08 Thread Prabhu Joseph
+ Spark-Dev

On Tue, Feb 9, 2016 at 10:04 AM, Prabhu Joseph 
wrote:

> Hi All,
>
> A long running Spark job on YARN throws below exception after running
> for few days.
>
> yarn.ApplicationMaster: Reporter thread fails 1 time(s) in a row.
> org.apache.hadoop.yarn.exceptions.YarnException: *No AMRMToken found* for
> user prabhu at org.apache.hadoop.yarn.ipc.RPC
> Util.getRemoteException(RPCUtil.java:45)
>
> Do any of the below renew the AMRMToken and solve the issue
>
> 1. yarn-resourcemanager.delegation.token.max-lifetime increase from 7 days
>
> 2. Configuring Proxy user:
>
>  hadoop.proxyuser.yarn.hosts *
> 
>  hadoop.proxyuser.yarn.groups *
> 
>
> 3. Can Spark-1.4.0 handle with fix
> https://issues.apache.org/jira/browse/SPARK-5342
>
> spark.yarn.credentials.file
>
>
> How to renew the AMRMToken for a long running job on YARN?
>
>
> Thanks,
> Prabhu Joseph
>
>
>
>
>


Re: Preserving partitioning with dataframe select

2016-02-08 Thread Matt Cheah
Interesting ­ I might be misinterpreting my Spark UI then, in terms of the
number of stages I¹m seeing in the job before and after I¹m doing the
pre-partitioning.

That said, I was mostly thinking about this when reading through the code.
In particular, under basicOperators.scala in org.apache.spark.sql.execution,
the Project gets compiled down to child.executor.mapPartitionsInternal
without passing the preservesPartitioning flag. Is this Projection being
moved around in the case that the optimizer wants to take advantage of
co-partitioning? Guidance on how to trace the planner¹s logic would be
appreciated!

-Matt Cheah

From:  Reynold Xin 
Date:  Sunday, February 7, 2016 at 11:11 PM
To:  Matt Cheah 
Cc:  "dev@spark.apache.org" , Mingyu Kim

Subject:  Re: Preserving partitioning with dataframe select

Matt, 

Thanks for the email. Are you just asking whether it should work, or
reporting they don't work?

Internally, the way we track physical data distribution should make the
scenarios described work. If it doesn't, we should make them work.


On Sat, Feb 6, 2016 at 6:49 AM, Matt Cheah  wrote:
> Hi everyone, 
> 
> When using raw RDDs, it is possible to have a map() operation indicate that
> the partitioning for the RDD would be preserved by the map operation. This
> makes it easier to reduce the overhead of shuffles by ensuring that RDDs are
> co-partitioned when they are joined.
> 
> When I'm using Data Frames, I'm pre-partitioning the data frame by using
> DataFrame.partitionBy($"X"), but I will invoke a select statement after the
> partitioning before joining that dataframe with another. Roughly speaking, I'm
> doing something like this pseudo-code:
> 
> partitionedDataFrame = dataFrame.partitionBy("$X")
> groupedDataFrame = partitionedDataFrame.groupBy($"X").agg(aggregations)
> // Rename "X" to "Y" to make sure columns are unique
> groupedDataFrameRenamed = groupedDataFrame.withColumnRenamed("X", "Y")
> // Roughly speaking, join on "X == Y" to get the aggregation results onto
> every row
> joinedDataFrame = partitionedDataFrame.join(groupedDataFrame)
> 
> However the renaming of the columns maps to a select statement, and to my
> knowledge, selecting the columns is throwing off the partitioning which
> results in shuffle both the partitionedDataFrame and the groupedDataFrame.
> 
> I have the following questions given this example:
> 
> 1) Is pre-partitioning the Data Frame effective? In other words, does the
> physical planner recognize when underlying RDDs are co-partitioned and compute
> more efficient joins by reducing the amount of data that is shuffled?
> 2) If the planner takes advantage of co-partitioning, is the renaming of the
> columns invalidating the partitioning of the grouped Data Frame? When I look
> at the planner's conversion from logical.Project to the physical plan, I only
> see it invoking child.mapPartitions without specifying the
> preservesPartitioning flag.
> 
> Thanks,
> 
> -Matt Cheah





smime.p7s
Description: S/MIME cryptographic signature


Welcoming two new committers

2016-02-08 Thread Matei Zaharia
Hi all,

The PMC has recently added two new Spark committers -- Herman van Hovell and 
Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten, adding 
new features, optimizations and APIs. Please join me in welcoming Herman and 
Wenchen.

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Welcoming two new committers

2016-02-08 Thread Ted Yu
Congratulations, Herman and Wenchen.

On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia 
wrote:

> Hi all,
>
> The PMC has recently added two new Spark committers -- Herman van Hovell
> and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten,
> adding new features, optimizations and APIs. Please join me in welcoming
> Herman and Wenchen.
>
> Matei
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


[build system] brief downtime, 8am PST thursday feb 10th

2016-02-08 Thread shane knapp
happy monday!

i will be bringing down jenkins and the workers thursday morning to
upgrade docker on all of the workers from 1.5.0-1 to 1.7.1-2.

as of december last year, docker 1.5 and older lost the ability to
pull from the docker hub.  since we're running centos 6.X on our
workers, and can't run the 3.X kernel, that limits our options to
docker 1.7.

this will allow us to close out https://github.com/apache/spark/pull/9893

i'll be sure to send updates as they happen.

shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Welcoming two new committers

2016-02-08 Thread Bhupendra Mishra
Congratulations to both. and welcome to group.

On Mon, Feb 8, 2016 at 10:45 PM, Matei Zaharia 
wrote:

> Hi all,
>
> The PMC has recently added two new Spark committers -- Herman van Hovell
> and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten,
> adding new features, optimizations and APIs. Please join me in welcoming
> Herman and Wenchen.
>
> Matei
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Welcoming two new committers

2016-02-08 Thread Corey Nolet
Congrats guys!

On Mon, Feb 8, 2016 at 12:23 PM, Ted Yu  wrote:

> Congratulations, Herman and Wenchen.
>
> On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> The PMC has recently added two new Spark committers -- Herman van Hovell
>> and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten,
>> adding new features, optimizations and APIs. Please join me in welcoming
>> Herman and Wenchen.
>>
>> Matei
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: Welcoming two new committers

2016-02-08 Thread Luciano Resende
On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia 
wrote:

> Hi all,
>
> The PMC has recently added two new Spark committers -- Herman van Hovell
> and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten,
> adding new features, optimizations and APIs. Please join me in welcoming
> Herman and Wenchen.
>
> Matei
>

Congratulations !!!

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: Welcoming two new committers

2016-02-08 Thread Dilip Biswal
Congratulations Wenchen and Herman !! 

Regards,
Dilip Biswal
Tel: 408-463-4980
dbis...@us.ibm.com



From:   Xiao Li 
To: Corey Nolet 
Cc: Ted Yu , Matei Zaharia 
, dev 
Date:   02/08/2016 09:39 AM
Subject:Re: Welcoming two new committers



Congratulations! Herman and Wenchen!  I am just so happy for you! You 
absolutely deserve it!

2016-02-08 9:35 GMT-08:00 Corey Nolet :
Congrats guys! 

On Mon, Feb 8, 2016 at 12:23 PM, Ted Yu  wrote:
Congratulations, Herman and Wenchen.

On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia  
wrote:
Hi all,

The PMC has recently added two new Spark committers -- Herman van Hovell 
and Wenchen Fan. Both have been heavily involved in Spark SQL and 
Tungsten, adding new features, optimizations and APIs. Please join me in 
welcoming Herman and Wenchen.

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org








Re: Welcoming two new committers

2016-02-08 Thread Denny Lee
Awesome - congratulations Herman and Wenchan!

On Mon, Feb 8, 2016 at 10:26 AM Dilip Biswal  wrote:

> Congratulations Wenchen and Herman !!
>
> Regards,
> Dilip Biswal
> Tel: 408-463-4980
> dbis...@us.ibm.com
>
>
>
> From:Xiao Li 
> To:Corey Nolet 
> Cc:Ted Yu , Matei Zaharia <
> matei.zaha...@gmail.com>, dev 
> Date:02/08/2016 09:39 AM
> Subject:Re: Welcoming two new committers
> --
>
>
>
> Congratulations! Herman and Wenchen!  I am just so happy for you! You
> absolutely deserve it!
>
> 2016-02-08 9:35 GMT-08:00 Corey Nolet <*cjno...@gmail.com*
> >:
> Congrats guys!
>
> On Mon, Feb 8, 2016 at 12:23 PM, Ted Yu <*yuzhih...@gmail.com*
> > wrote:
> Congratulations, Herman and Wenchen.
>
> On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia <*matei.zaha...@gmail.com*
> > wrote:
> Hi all,
>
> The PMC has recently added two new Spark committers -- Herman van Hovell
> and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten,
> adding new features, optimizations and APIs. Please join me in welcoming
> Herman and Wenchen.
>
> Matei
> -
> To unsubscribe, e-mail: *dev-unsubscr...@spark.apache.org*
> 
> For additional commands, e-mail: *dev-h...@spark.apache.org*
> 
>
>
>
>
>
>


Re: Welcoming two new committers

2016-02-08 Thread Shixiong(Ryan) Zhu
Congrats!!! Herman and Wenchen!!!

On Mon, Feb 8, 2016 at 10:44 AM, Luciano Resende 
wrote:

>
>
> On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> The PMC has recently added two new Spark committers -- Herman van Hovell
>> and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten,
>> adding new features, optimizations and APIs. Please join me in welcoming
>> Herman and Wenchen.
>>
>> Matei
>>
>
> Congratulations !!!
>
> --
> Luciano Resende
> http://people.apache.org/~lresende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>


Re: Welcoming two new committers

2016-02-08 Thread Xiao Li
Congratulations! Herman and Wenchen!  I am just so happy for you! You
absolutely deserve it!

2016-02-08 9:35 GMT-08:00 Corey Nolet :

> Congrats guys!
>
> On Mon, Feb 8, 2016 at 12:23 PM, Ted Yu  wrote:
>
>> Congratulations, Herman and Wenchen.
>>
>> On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia 
>> wrote:
>>
>>> Hi all,
>>>
>>> The PMC has recently added two new Spark committers -- Herman van Hovell
>>> and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten,
>>> adding new features, optimizations and APIs. Please join me in welcoming
>>> Herman and Wenchen.
>>>
>>> Matei
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>


Re: Welcoming two new committers

2016-02-08 Thread Ram Sriharsha
great job guys! congrats and welcome!

On Mon, Feb 8, 2016 at 12:05 PM, Amit Chavan  wrote:

> Welcome.
>
> On Mon, Feb 8, 2016 at 2:50 PM, Suresh Thalamati <
> suresh.thalam...@gmail.com> wrote:
>
>> Congratulations Herman and Wenchen!
>>
>> On Mon, Feb 8, 2016 at 10:59 AM, Andrew Or  wrote:
>>
>>> Welcome!
>>>
>>> 2016-02-08 10:55 GMT-08:00 Bhupendra Mishra 
>>> :
>>>
 Congratulations to both. and welcome to group.

 On Mon, Feb 8, 2016 at 10:45 PM, Matei Zaharia  wrote:

> Hi all,
>
> The PMC has recently added two new Spark committers -- Herman van
> Hovell and Wenchen Fan. Both have been heavily involved in Spark SQL and
> Tungsten, adding new features, optimizations and APIs. Please join me in
> welcoming Herman and Wenchen.
>
> Matei
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

>>>
>>
>


-- 
Ram Sriharsha
Architect, Spark and Data Science
Hortonworks, 2550 Great America Way, 2nd Floor
Santa Clara, CA 95054
Ph: 408-510-8635
email: har...@apache.org

[image: https://www.linkedin.com/in/harsha340]
 



Re: Welcoming two new committers

2016-02-08 Thread Andrew Or
Welcome!

2016-02-08 10:55 GMT-08:00 Bhupendra Mishra :

> Congratulations to both. and welcome to group.
>
> On Mon, Feb 8, 2016 at 10:45 PM, Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> The PMC has recently added two new Spark committers -- Herman van Hovell
>> and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten,
>> adding new features, optimizations and APIs. Please join me in welcoming
>> Herman and Wenchen.
>>
>> Matei
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: Welcoming two new committers

2016-02-08 Thread Suresh Thalamati
Congratulations Herman and Wenchen!

On Mon, Feb 8, 2016 at 10:59 AM, Andrew Or  wrote:

> Welcome!
>
> 2016-02-08 10:55 GMT-08:00 Bhupendra Mishra :
>
>> Congratulations to both. and welcome to group.
>>
>> On Mon, Feb 8, 2016 at 10:45 PM, Matei Zaharia 
>> wrote:
>>
>>> Hi all,
>>>
>>> The PMC has recently added two new Spark committers -- Herman van Hovell
>>> and Wenchen Fan. Both have been heavily involved in Spark SQL and Tungsten,
>>> adding new features, optimizations and APIs. Please join me in welcoming
>>> Herman and Wenchen.
>>>
>>> Matei
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>


Re: Welcoming two new committers

2016-02-08 Thread Amit Chavan
Welcome.

On Mon, Feb 8, 2016 at 2:50 PM, Suresh Thalamati  wrote:

> Congratulations Herman and Wenchen!
>
> On Mon, Feb 8, 2016 at 10:59 AM, Andrew Or  wrote:
>
>> Welcome!
>>
>> 2016-02-08 10:55 GMT-08:00 Bhupendra Mishra :
>>
>>> Congratulations to both. and welcome to group.
>>>
>>> On Mon, Feb 8, 2016 at 10:45 PM, Matei Zaharia 
>>> wrote:
>>>
 Hi all,

 The PMC has recently added two new Spark committers -- Herman van
 Hovell and Wenchen Fan. Both have been heavily involved in Spark SQL and
 Tungsten, adding new features, optimizations and APIs. Please join me in
 welcoming Herman and Wenchen.

 Matei
 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


>>>
>>
>


Spark in Production - Use Cases

2016-02-08 Thread Scott walent
Spark Summit East is just 10 days away and we are almost sold out! One of
the highlights this year will focus on how Spark is being used across
businesses to solve both big and small data needs. Check out the full
agenda here: https://spark-summit.org/east-2016/schedule/

Use "ApacheList" for 30% off at registration.

We wanted to highlight a few talks, including keynotes from:
- Chris D'Agostino: Vice President Digital and US Card Servicing Technology
and Engineering at Capital One
- Matei Zaharia:CTO and Co-founder from Databricks
- Seshu Adunuthula: Head of Analytics Infrastructure at eBay

The keynotes are just the start of the summit. The community submitted over
200 talks and we narrowed it down to 60 to be presented in NYC. Here is
just a sampling:
- Top 5 Mistakes When Writing Spark Applications from Cloudera
- Structuring Spark: DataFrames, Datasets, and Streaming from Databricks
- TopNotch: Systematically Quality Controlling Big Data from BlackRock
- Distributed Time Travel for Feature Generation by Netflix

This will be our only summit on the east coast this year, register today to
guarantee a seat! https://spark-summit.org/east-2016/


Re: Welcoming two new committers

2016-02-08 Thread Joseph Bradley
Congrats & welcome!

On Mon, Feb 8, 2016 at 12:19 PM, Ram Sriharsha 
wrote:

> great job guys! congrats and welcome!
>
> On Mon, Feb 8, 2016 at 12:05 PM, Amit Chavan  wrote:
>
>> Welcome.
>>
>> On Mon, Feb 8, 2016 at 2:50 PM, Suresh Thalamati <
>> suresh.thalam...@gmail.com> wrote:
>>
>>> Congratulations Herman and Wenchen!
>>>
>>> On Mon, Feb 8, 2016 at 10:59 AM, Andrew Or 
>>> wrote:
>>>
 Welcome!

 2016-02-08 10:55 GMT-08:00 Bhupendra Mishra :

> Congratulations to both. and welcome to group.
>
> On Mon, Feb 8, 2016 at 10:45 PM, Matei Zaharia <
> matei.zaha...@gmail.com> wrote:
>
>> Hi all,
>>
>> The PMC has recently added two new Spark committers -- Herman van
>> Hovell and Wenchen Fan. Both have been heavily involved in Spark SQL and
>> Tungsten, adding new features, optimizations and APIs. Please join me in
>> welcoming Herman and Wenchen.
>>
>> Matei
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

>>>
>>
>
>
> --
> Ram Sriharsha
> Architect, Spark and Data Science
> Hortonworks, 2550 Great America Way, 2nd Floor
> Santa Clara, CA 95054
> Ph: 408-510-8635
> email: har...@apache.org
>
> [image: https://www.linkedin.com/in/harsha340]
>  
> 
>
>


Re: pyspark worker concurrency

2016-02-08 Thread Renyi Xiong
never mind,  I think pyspark is already doing async socket read / write,
but on scala side in PythonRDD.scala

On Sat, Feb 6, 2016 at 6:27 PM, Renyi Xiong  wrote:

> Hi,
>
> is it a good idea to have 2 threads in pyspark worker? -  main thread
> responsible for receive and send data over socket while the other thread is
> calling user functions to process data?
>
> since CPU is idle (?) during network I/O, this should improve concurrency
> quite a bit.
>
> can expert answer the question? what are the pros and cons here?
>
> thanks,
> Renyi.
>
>
>