Re: how to classify column
that's good. thanks On 2022/2/12 12:11, Raghavendra Ganesh wrote: .withColumn("newColumn",expr(s"case when score>3 then 'good' else 'bad' end")) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
how to classify column
Hello I have a column whose value (Int type as score) is from 0 to 5. I want to query that, when the score > 3, classified as "good". else classified as "bad". How do I implement that? A UDF like something as this? scala> implicit class Foo(i:Int) { | def classAs(f:Int=>String) = f(i) | } class Foo scala> 4.classAs { x => if (x > 3) "good" else "bad" } val res13: String = good scala> 2.classAs { x => if (x > 3) "good" else "bad" } val res14: String = bad Thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: data size exceeds the total ram
Hello list I have imported the data into spark and I found there is disk IO in every node. The memory didn't get overflow. But such query is quite slow: >>> df.groupBy("rvid").agg({'rate':'avg','rvid':'count'}).show() May I ask: 1. since I have 3 nodes (as known as 3 executors?), are there 3 partitions for each job? 2. can I expand the partition by hand to increase the performance? Thanks On 2022/2/11 6:22, frakass wrote: On 2022/2/11 6:16, Gourav Sengupta wrote: What is the source data (is it JSON, CSV, Parquet, etc)? Where are you reading it from (JDBC, file, etc)? What is the compression format (GZ, BZIP, etc)? What is the SPARK version that you are using? it's a well built csv file (no compressed) stored in HDFS. spark 3.2.0 Thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: data size exceeds the total ram
On 2022/2/11 6:16, Gourav Sengupta wrote: What is the source data (is it JSON, CSV, Parquet, etc)? Where are you reading it from (JDBC, file, etc)? What is the compression format (GZ, BZIP, etc)? What is the SPARK version that you are using? it's a well built csv file (no compressed) stored in HDFS. spark 3.2.0 Thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
data size exceeds the total ram
Hello I have three nodes with total memory 128G x 3 = 384GB But the input data is about 1TB. How can spark handle this case? Thanks. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Using Avro file format with SparkSQL
Have you added the dependency in the build.sbt? Can you 'sbt package' the source successfully? regards frakass On 2022/2/10 11:25, Karanika, Anna wrote: For context, I am invoking spark-submit and adding arguments --packages org.apache.spark:spark-avro_2.12:3.2.0. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: question on the different way of RDD to dataframe
I think it's better as: df1.map { case(w,x,y,z) => columns(w,x,y,z) } Thanks On 2022/2/9 12:46, Mich Talebzadeh wrote: scala> val df2 = df1.map(p => columns(p(0).toString,p(1).toString, p(2).toString,p(3).toString.toDouble)) // map those columns - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: flatMap for dataframe
Is this the scala syntax? Yes in scala I know how to do it by converting the df to a dataset. how for pyspark? Thanks On 2022/2/9 10:24, oliver dd wrote: df.flatMap(row => row.getAs[String]("value").split(" ")) - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: question on the different way of RDD to dataframe
I know that using case class I can control the data type strictly. scala> val rdd = sc.parallelize(List(("apple",1),("orange",2))) rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[0] at parallelize at :23 scala> rdd.toDF.printSchema root |-- _1: string (nullable = true) |-- _2: integer (nullable = false) I can specify the second column to other type such as Double by case class: scala> rdd.map{ case (x,y) => Fruit(x,y) }.toDF.printSchema root |-- fruit: string (nullable = true) |-- num: double (nullable = false) Thank you. On 2022/2/8 10:32, Sean Owen wrote: It's just a possibly tidier way to represent objects with named, typed fields, in order to specify a DataFrame's contents. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
flatMap for dataframe
Hello for the RDD I can apply flatMap method: >>> sc.parallelize(["a few words","ba na ba na"]).flatMap(lambda x: x.split(" ")).collect() ['a', 'few', 'words', 'ba', 'na', 'ba', 'na'] But for a dataframe table how can I flatMap that as above? >>> df.show() ++ | value| ++ | a few lines| |hello world here| | ba na ba na| ++ Thanks - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: unsubscribe
please send an empty message to: user-unsubscr...@spark.apache.org to unsubscribe yourself from the list. Thanks On 2022/1/15 7:04, ALOK KUMAR SINGH wrote: unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: groupMapReduce
OK thanks. I will check that. On 2022/1/14 7:09, David Diebold wrote: Hello, In RDD api, you must be looking for reduceByKey. Cheers Le ven. 14 janv. 2022 à 11:56, frakass <mailto:capitnfrak...@free.fr>> a écrit : Is there a RDD API which is similar to Scala's groupMapReduce? https://blog.genuine.com/2019/11/scalas-groupmap-and-groupmapreduce/ <https://blog.genuine.com/2019/11/scalas-groupmap-and-groupmapreduce/> Thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
groupMapReduce
Is there a RDD API which is similar to Scala's groupMapReduce? https://blog.genuine.com/2019/11/scalas-groupmap-and-groupmapreduce/ Thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: about memory size for loading file
for this case i have 3 partitions, each process 3.333 GB data, am i right? On 2022/1/14 2:20, Sonal Goyal wrote: No it should not. The file would be partitioned and read across each node. On Fri, 14 Jan 2022 at 11:48 AM, frakass <mailto:capitnfrak...@free.fr>> wrote: Hello list Given the case I have a file whose size is 10GB. The ram of total cluster is 24GB, three nodes. So the local node has only 8GB. If I load this file into Spark as a RDD via sc.textFile interface, will this operation run into "out of memory" issue? Thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org <mailto:user-unsubscr...@spark.apache.org> -- Cheers, Sonal https://github.com/zinggAI/zingg <https://github.com/zinggAI/zingg> - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
about memory size for loading file
Hello list Given the case I have a file whose size is 10GB. The ram of total cluster is 24GB, three nodes. So the local node has only 8GB. If I load this file into Spark as a RDD via sc.textFile interface, will this operation run into "out of memory" issue? Thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org