How to partition a SparkDataFrame using all distinct column values in sparkR

2016-07-25 Thread Neil Chang
Hi,
  This is a question regarding SparkR in spark 2.0.

Given that I have a SparkDataFrame and I want to partition it using one
column's values. Each value corresponds to a partition, all rows that
having the same column value shall go to the same partition, no more no
less.

   Seems the function repartition() doesn't do this, I have 394 unique
values, it just partitions my DataFrame into 200. If I specify the
numPartitions to 394, some mismatch happens.

Is it possible to do what I described in sparkR?
GroupBy doesn't work with udf at all.

Or can we split the DataFrame into list of small ones first, if so, what
can I use?

Thanks,
Neil


Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-23 Thread Neil Chang
One example for using dapply is to apply linear regression on many small
partitions.
I think red can do that with parallelism too but heard dapply is faster.

On Friday, July 22, 2016, Pedro Rodriguez <ski.rodrig...@gmail.com> wrote:

> I haven't used SparkR/R before, only Scala/Python APIs so I don't know for
> sure.
>
> I am guessing if things are in a DataFrame they were read either from some
> disk source (S3/HDFS/file/etc) or they were created from parallelize. If
> you are using the first, Spark will for the most part choose a reasonable
> number of partitions while for parallelize I think it depends on what your
> min parallelism is set to.
>
> In my brief google it looks like dapply is an analogue of mapPartitions.
> Usually the reason to use this is if your map operation has some expensive
> initialization function. For example, you need to open a connection to a
> database so its better to re-use that connection for one partition's
> elements than create it for each element.
>
> What are you trying to accomplish with dapply?
>
> On Fri, Jul 22, 2016 at 8:05 PM, Neil Chang <iam...@gmail.com
> <javascript:_e(%7B%7D,'cvml','iam...@gmail.com');>> wrote:
>
>> Thanks Pedro,
>>   so to use sparkR dapply on SparkDataFrame, don't we need partition the
>> DataFrame first? the example in doc doesn't seem to do this.
>> Without knowing how it partitioned, how can one write the function to
>> process each partition?
>>
>> Neil
>>
>> On Fri, Jul 22, 2016 at 5:56 PM, Pedro Rodriguez <ski.rodrig...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','ski.rodrig...@gmail.com');>> wrote:
>>
>>> This should work and I don't think triggers any actions:
>>>
>>> df.rdd.partitions.length
>>>
>>> On Fri, Jul 22, 2016 at 2:20 PM, Neil Chang <iam...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','iam...@gmail.com');>> wrote:
>>>
>>>> Seems no function does this in Spark 2.0 preview?
>>>>
>>>
>>>
>>>
>>> --
>>> Pedro Rodriguez
>>> PhD Student in Distributed Machine Learning | CU Boulder
>>> UC Berkeley AMPLab Alumni
>>>
>>> ski.rodrig...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','ski.rodrig...@gmail.com');> |
>>> pedrorodriguez.io | 909-353-4423
>>> Github: github.com/EntilZha | LinkedIn:
>>> https://www.linkedin.com/in/pedrorodriguezscience
>>>
>>>
>>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com
> <javascript:_e(%7B%7D,'cvml','ski.rodrig...@gmail.com');> |
> pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


Re: How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Neil Chang
Thanks Pedro,
  so to use sparkR dapply on SparkDataFrame, don't we need partition the
DataFrame first? the example in doc doesn't seem to do this.
Without knowing how it partitioned, how can one write the function to
process each partition?

Neil

On Fri, Jul 22, 2016 at 5:56 PM, Pedro Rodriguez <ski.rodrig...@gmail.com>
wrote:

> This should work and I don't think triggers any actions:
>
> df.rdd.partitions.length
>
> On Fri, Jul 22, 2016 at 2:20 PM, Neil Chang <iam...@gmail.com> wrote:
>
>> Seems no function does this in Spark 2.0 preview?
>>
>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>


How to get the number of partitions for a SparkDataFrame in Spark 2.0-preview?

2016-07-22 Thread Neil Chang
Seems no function does this in Spark 2.0 preview?


Re: spark worker continuously trying to connect to master and failed in standalone mode

2016-07-22 Thread Neil Chang
Thank you guys, it is the port issue.

On Wed, Jul 20, 2016 at 11:03 AM, Igor Berman <igor.ber...@gmail.com> wrote:

> in addition check what ip the master is binding to(with nestat)
>
> On 20 July 2016 at 06:12, Andrew Ehrlich <and...@aehrlich.com> wrote:
>
>> Troubleshooting steps:
>>
>> $ telnet localhost 7077 (on master, to confirm port is open)
>> $ telnet  7077 (on slave, to confirm port is blocked)
>>
>> If the port is available on the master from the master, but not on the
>> master from the slave, check firewall settings on the master:
>> https://help.ubuntu.com/lts/serverguide/firewall.html
>>
>> On Jul 19, 2016, at 6:25 PM, Neil Chang <iam...@gmail.com> wrote:
>>
>> Hi,
>>   I have two virtual pcs on private cloud (ubuntu 14). I installed spark
>> 2.0 preview on both machines. I then tried to test it with standalone mode.
>> I have no problem start the master. However, when I start the worker
>> (slave) on another machine, it makes many attempts to connect to master and
>> failed at the end.
>>   I can ssh from each machine to another without any problem. I can also
>> run a master and worker at the same machine without any problem.
>>
>> What did I miss? Any clue?
>>
>> here are the messages:
>>
>> WARN NativeCodeLoader: Unable to load native-hadoop library for your
>> platform ... using builtin-java classes where applicable
>> ..
>> INFO Worker: Connecting to master ip:7077 ...
>> INFO Worker: Retrying connection to master (attempt #1)
>> ..
>> INFO Worker: Retrying connection to master (attempt #7)
>> java.lang.IllegalArgumentException: requirement failed: TransportClient
>> has not yet been set.
>>at scala.Predef$.require(Predef.scala:224)
>> ...
>> WARN NettyRocEnv: Ignored failure: java.io.IOException: Connecting to
>> ip:7077 timed out
>> WARN Worker: Failed to connect to master ip.7077
>>
>>
>>
>> Thanks,
>> Neil
>>
>>
>>
>


spark worker continuously trying to connect to master and failed in standalone mode

2016-07-19 Thread Neil Chang
Hi,
  I have two virtual pcs on private cloud (ubuntu 14). I installed spark
2.0 preview on both machines. I then tried to test it with standalone mode.
I have no problem start the master. However, when I start the worker
(slave) on another machine, it makes many attempts to connect to master and
failed at the end.
  I can ssh from each machine to another without any problem. I can also
run a master and worker at the same machine without any problem.

What did I miss? Any clue?

here are the messages:

WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform ... using builtin-java classes where applicable
..
INFO Worker: Connecting to master ip:7077 ...
INFO Worker: Retrying connection to master (attempt #1)
..
INFO Worker: Retrying connection to master (attempt #7)
java.lang.IllegalArgumentException: requirement failed: TransportClient has
not yet been set.
   at scala.Predef$.require(Predef.scala:224)
...
WARN NettyRocEnv: Ignored failure: java.io.IOException: Connecting to
ip:7077 timed out
WARN Worker: Failed to connect to master ip.7077



Thanks,
Neil