Re: Sorting tuples with byte key and byte value
Hi Supun, A couple of things with regard to your question. --executor-cores means the number of worker threads per VM. According to your requirement this should be set to 8. *repartitionAndSortWithinPartitions *is a RDD operation, RDD operations in Spark are not performant both in terms of execution and memory. I would rather use Dataframe sort operation if performance is key. Regards, Keith. http://keith-chapman.com On Mon, Jul 15, 2019 at 8:45 AM Supun Kamburugamuve < supun.kamburugam...@gmail.com> wrote: > Hi all, > > We are trying to measure the sorting performance of Spark. We have a 16 > node cluster with 48 cores and 256GB of ram in each machine and 10Gbps > network. > > Let's say we are running with 128 parallel tasks and each partition > generates about 1GB of data (total 128GB). > > We are using the method *repartitionAndSortWithinPartitions* > > A standalone cluster is used with the following configuration. > > SPARK_WORKER_CORES=1 > SPARK_WORKER_MEMORY=16G > SPARK_WORKER_INSTANCES=8 > > --executor-memory 16G --executor-cores 1 --num-executors 128 > > I believe this sets 128 executors to run the job each having 16GB of > memory and spread across 16 nodes with 8 threads in each node. This > configuration runs very slow. The program doesn't use disks to read or > write data (data generated in-memory and we don't write to file after > sorting). > > It seems even though the data size is small, it uses disk for the shuffle. > We are not sure our configurations are optimal to achieve the best > performance. > > Best, > Supun.. > >
Re: Release Apache Spark 2.4.4 before 3.0.0
Hi, Apache Spark PMC members. Can we cut Apache Spark 2.4.4 next Monday (22nd July)? Bests, Dongjoon. On Fri, Jul 12, 2019 at 3:18 PM Dongjoon Hyun wrote: > Thank you, Jacek. > > BTW, I added `@private` since we need PMC's help to make an Apache Spark > release. > > Can I get more feedbacks from the other PMC members? > > Please me know if you have any concerns (e.g. Release date or Release > manager?) > > As one of the community members, I assumed the followings (if we are on > schedule). > > - 2.4.4 at the end of July > - 2.3.4 at the end of August (since 2.3.0 was released at the end of > February 2018) > - 3.0.0 (possibily September?) > - 3.1.0 (January 2020?) > > Bests, > Dongjoon. > > > On Thu, Jul 11, 2019 at 1:30 PM Jacek Laskowski wrote: > >> Hi, >> >> Thanks Dongjoon Hyun for stepping up as a release manager! >> Much appreciated. >> >> If there's a volunteer to cut a release, I'm always to support it. >> >> In addition, the more frequent releases the better for end users so they >> have a choice to upgrade and have all the latest fixes or wait. It's their >> call not ours (when we'd keep them waiting). >> >> My big 2 yes'es for the release! >> >> Jacek >> >> >> On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun, wrote: >> >>> Hi, All. >>> >>> Spark 2.4.3 was released two months ago (8th May). >>> >>> As of today (9th July), there exist 45 fixes in `branch-2.4` including >>> the following correctness or blocker issues. >>> >>> - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for >>> decimals not fitting in long >>> - SPARK-26045 Error in the spark 2.4 release package with the >>> spark-avro_2.11 dependency >>> - SPARK-27798 from_avro can modify variables in other rows in local >>> mode >>> - SPARK-27907 HiveUDAF should return NULL in case of 0 rows >>> - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist >>> entries >>> - SPARK-28308 CalendarInterval sub-second part should be padded >>> before parsing >>> >>> It would be great if we can have Spark 2.4.4 before we are going to get >>> busier for 3.0.0. >>> If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll >>> it next Monday. (15th July). >>> How do you think about this? >>> >>> Bests, >>> Dongjoon. >>> >>
Spark 2.4 scala 2.12 Regular Expressions Approach
Hi All, Could you please help me to fix the below issue using spark 2.4 , scala 2.12 How do we extract's the multiple values in the given file name pattern using spark/scala regular expression.please give me some idea on the below approach. object Driver { private val filePattern = xyzabc_source2target_adver_1stvalue_([a-zA-Z0-9]+)_2ndvalue_([a-zA-Z0-9]+)_3rdvalue_([a-zA-Z0-9]+)_4thvalue_ ([a-zA-Z0-9]+)_5thvalue_([a-zA-Z0-9]+)_6thvalue_([a-zA-Z0-9]+)_7thvalue_([a-zA-Z0-9]+)".r How to get all 7 values like "([a-zA-Z0-9]+)" from above regular expression pattern using spark scala and assigned it to the below processing method , i.e. case class schema fields def processing(x:Dataset[someData]){ x.map{ e => caseClassSchema( Field1 = 1stvalue Field2 = 2ndvalue Field3 = 3rdvalue Field4 = 4thvalue Field5 = 5thvalue Field6 = 6thvalue Field7 = 7thvalue ) } } Thanks Anbu -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Sorting tuples with byte key and byte value
Hi all, We are trying to measure the sorting performance of Spark. We have a 16 node cluster with 48 cores and 256GB of ram in each machine and 10Gbps network. Let's say we are running with 128 parallel tasks and each partition generates about 1GB of data (total 128GB). We are using the method *repartitionAndSortWithinPartitions* A standalone cluster is used with the following configuration. SPARK_WORKER_CORES=1 SPARK_WORKER_MEMORY=16G SPARK_WORKER_INSTANCES=8 --executor-memory 16G --executor-cores 1 --num-executors 128 I believe this sets 128 executors to run the job each having 16GB of memory and spread across 16 nodes with 8 threads in each node. This configuration runs very slow. The program doesn't use disks to read or write data (data generated in-memory and we don't write to file after sorting). It seems even though the data size is small, it uses disk for the shuffle. We are not sure our configurations are optimal to achieve the best performance. Best, Supun..
[PySpark] [SparkR] Is it possible to invoke a PySpark function with a SparkR DataFrame?
Hi all, Forgive this naïveté, I'm looking for reassurance from some experts! In the past we created a tailored Spark library for our organisation, implementing Spark functions in Scala with Python and R "wrappers" on top, but the focus on Scala has alienated our analysts/statisticians/data scientists and collaboration is important for us (yeah... we're aware that your SDKs are very similar across languages... :/ ). We'd like to see if we could forego the Scala facet in order to present the source code in a language more familiar to users and internal contributors. We'd ideally write our functions with PySpark and potentially create a SparkR "wrapper" over the top, leading to the question: Given a function written with PySpark that accepts a DataFrame parameter, is there a way to invoke this function using a SparkR DataFrame? Is there any reason to pursue this? Is it even possible? Many thanks, Danny For the latest data on the economy and society, consult our website at http://www.ons.gov.uk *** Please Note: Incoming and outgoing email messages are routinely monitored for compliance with our policy on the use of electronic communications *** Legal Disclaimer: Any views expressed by the sender of this message are not necessarily those of the Office for National Statistics ***