Re: java.lang.UnsupportedOperationException: Cannot evaluate expression: fun_nm(input[0, string, true])

2016-08-16 Thread Sumit Khanna
This is just the stacktrace,but where is it you ccalling the UDF? Regards, Sumit On 16-Aug-2016 2:20 pm, "pseudo oduesp" wrote: > hi, > i cretae new columns with udf after i try to filter this columns : > i get this error why ? > > :

hdfs persist rollbacks when spark job is killed

2016-08-08 Thread Sumit Khanna
Hello, the use case is as follows : say I am inserting 200K rows as dataframe.write.formate("parquet") etc etc (like a basic write to hdfs command), but say due to some reason or rhyme my job got killed, when the run was in the mid of it, meaning lets say I was only able to insert 100K rows

silence the spark debug logs

2016-08-07 Thread Sumit Khanna
Hello, I dont want to print the all spark logs, but say a few only, e.g just the executions plans etc etc. How do I silence the spark debug ? Thanks, Sumit

Re: spark df schema to hive schema converter func

2016-08-06 Thread Sumit Khanna
wrt https://issues.apache.org/jira/browse/SPARK-5236. How do I also, usually convert something of type DecimalType to int/ string/ etc etc. Thanks, On Sun, Aug 7, 2016 at 10:33 AM, Sumit Khanna <sumit.kha...@askme.in> wrote: > Hi, > > was wondering if we have something

spark df schema to hive schema converter func

2016-08-06 Thread Sumit Khanna
Hi, was wondering if we have something like that takes as an argument a spark df type e.g DecimalType(12,5) and converts it into the corresponding hive schema type. Double / Decimal / String ? Any ideas. Thanks,

Re: how to debug spark app?

2016-08-03 Thread Sumit Khanna
Am not really sure of the best practices on this , but I either consult the localhost:4040/jobs/ etc or better this : val customSparkListener: CustomSparkListener = new CustomSparkListener() sc.addSparkListener(customSparkListener) class CustomSparkListener extends SparkListener { override def

Re: multiple spark streaming contexts

2016-08-01 Thread Sumit Khanna
you can you you if..else construction for splitting your messages by > names in foreachRDD: > > lines.foreachRDD((recrdd, time: Time) => { > >recrdd.foreachPartition(part => { > > part.foreach(item_row => { > > if (item_row("table_name&qu

Re: multiple spark streaming contexts

2016-08-01 Thread Sumit Khanna
contexts can all be handled well within a single jar. Guys please reply, Awaiting, Thanks, Sumit On Mon, Aug 1, 2016 at 12:24 AM, Sumit Khanna <sumit.kha...@askme.in> wrote: > Any ideas on this one guys ? > > I can do a sample run but can't be sure of imminent problems if any? How

Re: multiple spark streaming contexts

2016-07-31 Thread Sumit Khanna
Any ideas on this one guys ? I can do a sample run but can't be sure of imminent problems if any? How can I ensure different batchDuration etc etc in here, per StreamingContext. Thanks, On Sun, Jul 31, 2016 at 10:50 AM, Sumit Khanna <sumit.kha...@askme.in> wrote: > Hey, > > Was

multiple spark streaming contexts

2016-07-30 Thread Sumit Khanna
Hey, Was wondering if I could create multiple spark stream contexts in my application (e.g instantiating a worker actor per topic and it has its own streaming context its own batch duration everything). What are the caveats if any? What are the best practices? Have googled half heartedly on the

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna
s the full > execution plan without the write output. > > > > That code below looks perfectly normal for writing a parquet file yes, > there shouldn’t be any tuning needed for “normal” performance. > > > > Thanks, > > Ewan > > > > *From:* Sumit Khanna [mai

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna
ization by any chance or an obsolete hard disk or Intel Celeron may > be? > > Regards, > Gourav Sengupta > > On Fri, Jul 29, 2016 at 7:27 AM, Sumit Khanna <sumit.kha...@askme.in> > wrote: > >> Hey, >> >> master=yarn >> mode=cluster >> >>

Re: how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna
Hey, So I believe this is the right format to save the file, as in optimization is never in the write part, but with the head / body of my execution plan isnt it? Thanks, On Fri, Jul 29, 2016 at 11:57 AM, Sumit Khanna <sumit.kha...@askme.in> wrote: > Hey, > > master=yarn

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread Sumit Khanna
number of partitions read. > > My advise: Give HBase a shot. It gives UPSERT out of box. If you want > history, just add timestamp in the key (in reverse). Computation engines > easily support HBase. > > Best > Ayan > > On Fri, Jul 29, 2016 at 5:03 PM, Sumit Khanna <sumit.k

Re: correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread Sumit Khanna
Just a note, I had the delta_df keys for the filter as in NOT INTERSECTION udf broadcasted to all the worker nodes. Which I think is an efficient move enough. Thanks, On Fri, Jul 29, 2016 at 12:19 PM, Sumit Khanna <sumit.kha...@askme.in> wrote: > Hey, > > the very first run

correct / efficient manner to upsert / update in hdfs (via spark / in general)

2016-07-29 Thread Sumit Khanna
Hey, the very first run : glossary : delta_df := current run / execution changes dataframe. def deduplicate : apply windowing function and group by def partitionDataframe(delta_df) : get unique keys of that data frame and then return an array of data frames each containing just that very same

how to save spark files as parquets efficiently

2016-07-29 Thread Sumit Khanna
Hey, master=yarn mode=cluster spark.executor.memory=8g spark.rpc.netty.dispatcher.numThreads=2 All the POC on a single node cluster. the biggest bottle neck being : 1.8 hrs to save 500k records as a parquet file/dir executing this command :