Re: Sorting tuples with byte key and byte value

2019-07-15 Thread Keith Chapman
Hi Supun,

A couple of things with regard to your question.

--executor-cores means the number of worker threads per VM. According to
your requirement this should be set to 8.

*repartitionAndSortWithinPartitions *is a RDD operation, RDD operations in
Spark are not performant both in terms of execution and memory. I would
rather use Dataframe sort operation if performance is key.

Regards,
Keith.

http://keith-chapman.com


On Mon, Jul 15, 2019 at 8:45 AM Supun Kamburugamuve <
supun.kamburugam...@gmail.com> wrote:

> Hi all,
>
> We are trying to measure the sorting performance of Spark. We have a 16
> node cluster with 48 cores and 256GB of ram in each machine and 10Gbps
> network.
>
> Let's say we are running with 128 parallel tasks and each partition
> generates about 1GB of data (total 128GB).
>
> We are using the method *repartitionAndSortWithinPartitions*
>
> A standalone cluster is used with the following configuration.
>
> SPARK_WORKER_CORES=1
> SPARK_WORKER_MEMORY=16G
> SPARK_WORKER_INSTANCES=8
>
> --executor-memory 16G --executor-cores 1 --num-executors 128
>
> I believe this sets 128 executors to run the job each having 16GB of
> memory and spread across 16 nodes with 8 threads in each node. This
> configuration runs very slow. The program doesn't use disks to read or
> write data (data generated in-memory and we don't write to file after
> sorting).
>
> It seems even though the data size is small, it uses disk for the shuffle.
> We are not sure our configurations are optimal to achieve the best
> performance.
>
> Best,
> Supun..
>
>


Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-15 Thread Dongjoon Hyun
Hi, Apache Spark PMC members.

Can we cut Apache Spark 2.4.4 next Monday (22nd July)?

Bests,
Dongjoon.


On Fri, Jul 12, 2019 at 3:18 PM Dongjoon Hyun 
wrote:

> Thank you, Jacek.
>
> BTW, I added `@private` since we need PMC's help to make an Apache Spark
> release.
>
> Can I get more feedbacks from the other PMC members?
>
> Please me know if you have any concerns (e.g. Release date or Release
> manager?)
>
> As one of the community members, I assumed the followings (if we are on
> schedule).
>
> - 2.4.4 at the end of July
> - 2.3.4 at the end of August (since 2.3.0 was released at the end of
> February 2018)
> - 3.0.0 (possibily September?)
> - 3.1.0 (January 2020?)
>
> Bests,
> Dongjoon.
>
>
> On Thu, Jul 11, 2019 at 1:30 PM Jacek Laskowski  wrote:
>
>> Hi,
>>
>> Thanks Dongjoon Hyun for stepping up as a release manager!
>> Much appreciated.
>>
>> If there's a volunteer to cut a release, I'm always to support it.
>>
>> In addition, the more frequent releases the better for end users so they
>> have a choice to upgrade and have all the latest fixes or wait. It's their
>> call not ours (when we'd keep them waiting).
>>
>> My big 2 yes'es for the release!
>>
>> Jacek
>>
>>
>> On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun,  wrote:
>>
>>> Hi, All.
>>>
>>> Spark 2.4.3 was released two months ago (8th May).
>>>
>>> As of today (9th July), there exist 45 fixes in `branch-2.4` including
>>> the following correctness or blocker issues.
>>>
>>> - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
>>> decimals not fitting in long
>>> - SPARK-26045 Error in the spark 2.4 release package with the
>>> spark-avro_2.11 dependency
>>> - SPARK-27798 from_avro can modify variables in other rows in local
>>> mode
>>> - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
>>> - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist
>>> entries
>>> - SPARK-28308 CalendarInterval sub-second part should be padded
>>> before parsing
>>>
>>> It would be great if we can have Spark 2.4.4 before we are going to get
>>> busier for 3.0.0.
>>> If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll
>>> it next Monday. (15th July).
>>> How do you think about this?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>


Spark 2.4 scala 2.12 Regular Expressions Approach

2019-07-15 Thread anbutech
Hi All,

Could you please help me to fix the below issue using spark 2.4 , scala 2.12 

How do we extract's the multiple values in the given file name pattern using 
spark/scala regular expression.please 
give me some idea on the below approach.

object Driver {

private val filePattern =
xyzabc_source2target_adver_1stvalue_([a-zA-Z0-9]+)_2ndvalue_([a-zA-Z0-9]+)_3rdvalue_([a-zA-Z0-9]+)_4thvalue_
([a-zA-Z0-9]+)_5thvalue_([a-zA-Z0-9]+)_6thvalue_([a-zA-Z0-9]+)_7thvalue_([a-zA-Z0-9]+)".r

How to get all 7 values like "([a-zA-Z0-9]+)"  from above regular expression
pattern using spark scala 
and assigned it to the below processing method  , i.e. case class schema
fields

def processing(x:Dataset[someData]){

x.map{
e =>

caseClassSchema(
Field1 = 1stvalue
Field2 = 2ndvalue
Field3 = 3rdvalue
Field4 = 4thvalue
Field5 = 5thvalue
Field6 = 6thvalue
Field7 = 7thvalue
)
}
}


Thanks
Anbu




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Sorting tuples with byte key and byte value

2019-07-15 Thread Supun Kamburugamuve
Hi all,

We are trying to measure the sorting performance of Spark. We have a 16
node cluster with 48 cores and 256GB of ram in each machine and 10Gbps
network.

Let's say we are running with 128 parallel tasks and each partition
generates about 1GB of data (total 128GB).

We are using the method *repartitionAndSortWithinPartitions*

A standalone cluster is used with the following configuration.

SPARK_WORKER_CORES=1
SPARK_WORKER_MEMORY=16G
SPARK_WORKER_INSTANCES=8

--executor-memory 16G --executor-cores 1 --num-executors 128

I believe this sets 128 executors to run the job each having 16GB of memory
and spread across 16 nodes with 8 threads in each node. This configuration
runs very slow. The program doesn't use disks to read or write data (data
generated in-memory and we don't write to file after sorting).

It seems even though the data size is small, it uses disk for the shuffle.
We are not sure our configurations are optimal to achieve the best
performance.

Best,
Supun..


[PySpark] [SparkR] Is it possible to invoke a PySpark function with a SparkR DataFrame?

2019-07-15 Thread Fiske, Danny
Hi all,

Forgive this naïveté, I'm looking for reassurance from some experts!

In the past we created a tailored Spark library for our organisation, 
implementing Spark functions in Scala with Python and R "wrappers" on top, but 
the focus on Scala has alienated our analysts/statisticians/data scientists and 
collaboration is important for us (yeah... we're aware that your SDKs are very 
similar across languages... :/ ). We'd like to see if we could forego the Scala 
facet in order to present the source code in a language more familiar to users 
and internal contributors.

We'd ideally write our functions with PySpark and potentially create a SparkR 
"wrapper" over the top, leading to the question:

Given a function written with PySpark that accepts a DataFrame parameter, is 
there a way to invoke this function using a SparkR DataFrame?

Is there any reason to pursue this? Is it even possible?

Many thanks,

Danny

For the latest data on the economy and society, consult our website at 
http://www.ons.gov.uk

***
Please Note:  Incoming and outgoing email messages are routinely monitored for 
compliance with our policy
on the use of electronic communications

***

Legal Disclaimer:  Any views expressed by the sender of this message are not 
necessarily those of the
Office for National Statistics
***