Re: splitting columns into new columns

2017-07-17 Thread Pralabh Kumar
=schema.substring(0,schema.length-1) val sqlSchema = StructType(schema.split(",").map(s=>StructField(s,StringType,false))) sqlContext.createDataFrame(newDataSet,sqlSchema).show() Regards Pralabh Kumar On Mon, Jul 17, 2017 at 1:55 PM, nayan sharma <nayansharm...@gmail.com>

Re: Reading Hive tables Parallel in Spark

2017-07-17 Thread Pralabh Kumar
Run the spark context in multithreaded way . Something like this val spark = SparkSession.builder() .appName("practice") .config("spark.scheduler.mode","FAIR") .enableHiveSupport().getOrCreate() val sc = spark.sparkContext val hc = spark.sqlContext val thread1 = new Thread {

Re: Withcolumn date with sysdate

2017-06-30 Thread Pralabh Kumar
put default value inside lit df.withcolumn("date",lit("constant value")) On Fri, Jun 30, 2017 at 10:20 PM, sudhir k wrote: > Can we add a column to dataframe with a default value like sysdate .. I am > calling my udf but it is throwing error col expected . > > On spark

Re: GC overhead exceeded

2017-08-17 Thread Pralabh Kumar
what's is your exector memory , please share the code also On Fri, Aug 18, 2017 at 10:06 AM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > > HI, > > I am getting below error when running spark sql jobs. This error is thrown > after running 80% of tasks. any solution? > >

Re: Broadcasts & Storage Memory

2017-06-22 Thread Pralabh Kumar
smaller set of memory used on given executor for broadcast variables through UI ? Regards Pralabh Kumar On Thu, Jun 22, 2017 at 4:39 AM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Satish, > > I agree - that was my impression too. However I am seeing a smaller set of > s

Re: Re: spark2.1 kafka0.10

2017-06-22 Thread Pralabh Kumar
replicas . > > 2017-06-22 > -- > lk_spark > ------ > > *发件人:*Pralabh Kumar <pralabhku...@gmail.com> > *发送时间:*2017-06-22 17:23 > *主题:*Re: spark2.1 kafka0.10 > *收件人:*"lk_spark"<lk_sp...@163.com> > *抄送

Re: spark2.1 kafka0.10

2017-06-21 Thread Pralabh Kumar
How many replicas ,you have for this topic . On Thu, Jun 22, 2017 at 9:19 AM, lk_spark wrote: > java.lang.IllegalStateException: No current assignment for partition > pages-2 > at org.apache.kafka.clients.consumer.internals.SubscriptionState. >

Re: Question about Parallel Stages in Spark

2017-06-26 Thread Pralabh Kumar
. But in one thread ,one submit will complete and then the another one will start . If there are independent stages in one job, then those will run parallel. I agree with Bryan Jeffrey . Regards Pralabh Kumar On Tue, Jun 27, 2017 at 9:03 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > I think

Re: Question about Parallel Stages in Spark

2017-06-26 Thread Pralabh Kumar
n Tue, Jun 27, 2017 at 9:17 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > My words cause misunderstanding. > Step 1:A is submited to spark. > Step 2:B is submitted to spark. > > Spark gets two independent jobs.The FAIR is used to schedule A and B. > > Jeffrey' code did not ca

Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Pralabh Kumar
make sense :) On Sun, Jun 18, 2017 at 8:38 AM, 颜发才(Yan Facai) <facai@gmail.com> wrote: > Yes, perhaps we could use SQLTransformer as well. > > http://spark.apache.org/docs/latest/ml-features.html#sqltransformer > > On Sun, Jun 18, 2017 at 10:47 AM, Pralabh Kumar &l

Re: Best alternative for Category Type in Spark Dataframe

2017-06-17 Thread Pralabh Kumar
Hi Yan Yes sql is good option , but if we have to create ML Pipeline , then having transformers and set it into pipeline stages ,would be better option . Regards Pralabh Kumar On Sun, Jun 18, 2017 at 4:23 AM, 颜发才(Yan Facai) <facai@gmail.com> wrote: > To filter data, how about

Re: how to call udf with parameters

2017-06-15 Thread Pralabh Kumar
sample UDF val getlength=udf((data:String)=>data.length()) data.select(getlength(data("col1"))) On Fri, Jun 16, 2017 at 9:21 AM, lk_spark wrote: > hi,all > I define a udf with multiple parameters ,but I don't know how to > call it with DataFrame > > UDF: > > def ssplit2

Re: Re: how to call udf with parameters

2017-06-15 Thread Pralabh Kumar
t; and end index ? I try it with errors. Does the udf parameters could only > be a column type? > > 2017-06-16 > -- > lk_spark > ------ > > *发件人:*Pralabh Kumar <pralabhku...@gmail.com> > *发送时间:*2017-06-16 17:49 &g

Re: Re: how to call udf with parameters

2017-06-15 Thread Pralabh Kumar
val getlength=udf((idx1:Int,idx2:Int, data : String)=> data.substring(idx1,idx2)) data.select(getlength(lit(1),lit(2),data("col1"))).collect On Fri, Jun 16, 2017 at 10:22 AM, Pralabh Kumar <pralabhku...@gmail.com> wrote: > Use lit , give me some time , I'll provide an exam

Re: featureSubsetStrategy parameter for GradientBoostedTreesModel

2017-06-15 Thread Pralabh Kumar
level. Jira SPARK-20199 <https://issues.apache.org/jira/browse/SPARK-20199> Please let me know , if my understanding is correct. Regards Pralabh Kumar On Fri, Jun 16, 2017 at 7:53 AM, Pralabh Kumar <pralabhku...@gmail.com> wrote: > Hi everyone > > Currently GBT doesn

Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Pralabh Kumar
quot;abce","happy")).toDF("col1") val trans = new CategoryTransformer("1") data.show() trans.transform(data).show() This transformer will make sure , you always have values in col1 as provided by you. Regards Pralabh Kumar On Fri, Jun 16, 2017 at 8:10 PM, S

Re: Best alternative for Category Type in Spark Dataframe

2017-06-16 Thread Pralabh Kumar
Hi satvik Can u please provide an example of what exactly you want. On 16-Jun-2017 7:40 PM, "Saatvik Shah" wrote: > Hi Yan, > > Basically the reason I was looking for the categorical datatype is as > given here

Re: (Spark-ml) java.util.NosuchElementException: key not found exception on doing prediction and computing test error.

2017-06-28 Thread Pralabh Kumar
into this , and if that's not the case ,then Could you please share your code ,and training/testing data for better understanding. Regards Pralabh Kumar On Wed, Jun 28, 2017 at 11:45 AM, neha nihal <nehaniha...@gmail.com> wrote: > > Hi, > > I am using Apache spark 2.0.2 randomfor

Re: Spark GroupBy Save to different files

2017-09-04 Thread Pralabh Kumar
Hi arun rdd1.groupBy(_.city).map(s=>(s._1,s._2.toList.toString())).toDF("city","data").write. *partitionBy("city")*.csv("/data") should work for you . Regards Pralabh On Sat, Sep 2, 2017 at 7:58 AM, Ryan wrote: > you may try foreachPartition > > On Fri, Sep 1, 2017 at

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-02 Thread Pralabh Kumar
ur query is select sum(x), a from t group by a, then try select > sum(partial), a from (select sum(x) as partial, a, b from t group by a, b) > group by a. > > rb > ​ > > On Tue, May 1, 2018 at 4:21 AM, Pralabh Kumar <pralabhku...@gmail.com> > wrote: > >> Hi >>

org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Pralabh Kumar
Hi I am getting the above error in Spark SQL . I have increase (using 5000 ) number of partitions but still getting the same error . My data most probably is skew. org.apache.spark.shuffle.FetchFailedException: Too large frame: 4247124829 at

PIG to Spark

2018-01-08 Thread Pralabh Kumar
Hi Is there a convenient way /open source project to convert PIG scripts to Spark. Regards Pralabh Kumar

Does Spark and Hive use Same SQL parser : ANTLR

2018-01-18 Thread Pralabh Kumar
Hi Does hive and spark uses same SQL parser provided by ANTLR . Did they generate the same logical plan . Please help on the same. Regards Pralabh Kumar

Kryo serialization failed: Buffer overflow : Broadcast Join

2018-02-02 Thread Pralabh Kumar
Hi I am performing broadcast join where my small table is 1 gb . I am getting following error . I am using org.apache.spark.SparkException: . Available: 0, required: 28869232. To avoid this, increase spark.kryoserializer.buffer.max value I increase the value to

Re: Kryo serialization failed: Buffer overflow : Broadcast Join

2018-02-02 Thread Pralabh Kumar
I am using spark 2.1.0 On Fri, Feb 2, 2018 at 5:08 PM, Pralabh Kumar <pralabhku...@gmail.com> wrote: > Hi > > I am performing broadcast join where my small table is 1 gb . I am > getting following error . > > I am using > > > org.apache.spark.SparkException: >

Re: Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it

2018-02-08 Thread Pralabh Kumar
table 1. CREATE EXTERNAL TABLE $temp_output 2. ( 3. data String 4. ) 5. STORED BY 'ABCStorageHandler' LOCATION '$table_location' TBLPROPERTIES ( 6. 7. ); when I migrate to Spark it says STORED BY operation is not permitted. Regards Pralabh Kumar On Thu, Feb 8, 2018

Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it

2018-02-07 Thread Pralabh Kumar
Hi Spark 2.0 doesn't support stored by . Is there any alternative to achieve the same.

Best way to Hive to Spark migration

2018-04-04 Thread Pralabh Kumar
Hi Spark group What's the best way to Migrate Hive to Spark 1) Use HiveContext of Spark 2) Use Hive on Spark ( https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started ) 3) Migrate Hive to Calcite to Spark SQL Regards

Unable to pickle pySpark PipelineModel

2020-12-10 Thread Pralabh Kumar
Hi Dev , User I want to store spark ml model in databases , so that I can reuse them later on . I am unable to pickle them . However while using scala I am able to convert them into byte array stream . So for .eg I am able to do something below in scala but not in python val modelToByteArray

Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Pralabh Kumar
please guide me which option to go for . I am personally inclined to go for option 2 . It also allows the use of the latest spark . Please help me on the same , as there are not much comparisons online available keeping Spark 3.0 in perspective. Regards Pralabh Kumar

Re: Hive on Spark vs Spark on Hive(HiveContext)

2021-07-01 Thread Pralabh Kumar
responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. >

Spark Thriftserver is failing for when submitting command from beeline

2021-08-20 Thread Pralabh Kumar
abase(Hive.java:1556) at org.apache.hadoop.hive.ql.metadata.Hive.databaseExists(Hive.java:1545) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$databaseExists$1(HiveClientImpl.scala:384) My guess is authorization through proxy is not working . Please help Regards Pralabh Kumar

ivy unit test case filing for Spark

2021-12-21 Thread Pralabh Kumar
3) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) Regards Pralabh Kumar

Log4j 1.2.17 spark CVE

2021-12-12 Thread Pralabh Kumar
Hi developers, users Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on recent CVE detected ? Regards Pralabh kumar

Difference in behavior for Spark 3.0 vs Spark 3.1 "create database "

2022-01-10 Thread Pralabh Kumar
to prefix with hdfs to create db on hdfs. Why is there a difference in the behavior, Can you please point me to the jira which causes this change. Note : spark.sql.warehouse.dir and hive.metastore.warehouse.dir both are having default values(not explicitly set) Regards Pralabh Kumar

Skip single integration test case in Spark on K8s

2022-03-16 Thread Pralabh Kumar
m successfully able to run some test cases and some are failing . For e.g "Run SparkRemoteFileTest using a Remote data file" in KuberneterSuite is failing. Is there a way to skip running some of the test cases ?. Please help me on the same. Regards Pralabh Kumar

Spark on K8s , some applications ended ungracefully

2022-03-31 Thread Pralabh Kumar
) at org.apache.spark.util.ThreadUtils$.shutdown(ThreadUtils.scala:348) Please let me know if there is a solution for it .. Regards Pralabh Kumar

Spark on K8s : property simillar to yarn.max.application.attempt

2022-02-04 Thread Pralabh Kumar
machine . Is there a way to do the same . Regards Pralabh Kumar

Re: Spark on k8s : spark 3.0.1 spark.kubernetes.executor.deleteontermination issue

2022-01-18 Thread Pralabh Kumar
Does this property spark.kubernetes.executor.deleteontermination checks whether the executor which is deleted have shuffle data or not ? On Tue, 18 Jan 2022, 11:20 Pralabh Kumar, wrote: > Hi spark team > > Have cluster wide property spark.kubernetis.executor.deleteontermination

Spark on k8s : spark 3.0.1 spark.kubernetes.executor.deleteontermination issue

2022-01-17 Thread Pralabh Kumar
Hi spark team Have cluster wide property spark.kubernetis.executor.deleteontermination to true. During the long running job, some of the executor got deleted which have shuffle data. Because of this, in the subsequent stage , we get lot of spark shuffle fetch fail exceptions. Please let me

Spark 3.0.1 and spark 3.2 compatibility

2022-04-07 Thread Pralabh Kumar
Hi spark community I have quick question .I am planning to migrate from spark 3.0.1 to spark 3.2. Do I need to recompile my application with 3.2 dependencies or application compiled with 3.0.1 will work fine on 3.2 ? Regards Pralabh kumar

Spark3.2 on K8s with proxy-user

2022-04-21 Thread Pralabh Kumar
Hi Running Spark 3.2 on K8s with --proxy-user and getting below error and then the job fails . However when running without a proxy user job is running fine . Can anyone please help me with the same . 22/04/21 17:50:30 WARN Client: Exception encountered while connecting to the server :

Re: Spark3.2 on K8s with proxy-user

2022-04-21 Thread Pralabh Kumar
Further information . I have kerberized cluster and am also doing the kinit . Problem is only coming where the proxy user is being used . On Fri, Apr 22, 2022 at 10:21 AM Pralabh Kumar wrote: > Hi > > Running Spark 3.2 on K8s with --proxy-user and getting below error and > then t

Re: Driver takes long time to finish once job ends

2022-11-22 Thread Pralabh Kumar
How many cores and u are running driver with? On Tue, 22 Nov 2022, 21:00 Nikhil Goyal, wrote: > Hi folks, > We are running a job on our on prem cluster on K8s but writing the output > to S3. We noticed that all the executors finish in < 1h but the driver > takes another 5h to finish. Logs: > >

Re: Driver takes long time to finish once job ends

2022-11-22 Thread Pralabh Kumar
Cores and memory setting of driver ? On Wed, 23 Nov 2022, 12:56 Pralabh Kumar, wrote: > How many cores and u are running driver with? > > On Tue, 22 Nov 2022, 21:00 Nikhil Goyal, wrote: > >> Hi folks, >> We are running a job on our on prem cluster on K8s but writing

Spark3.3 with parquet 1.10.x

2023-07-24 Thread Pralabh Kumar
3.3 with parquet 1.10 ? What are the dos/ don't for it ? Regards Pralabh Kumar