Re: Suggestion on Join Approach with Spark

2019-05-15 Thread Chetan Khatri
(i.e. this list) is for discussions about the development of > Spark itself. > > On Wed, May 15, 2019 at 1:50 PM Chetan Khatri > wrote: > >> Any one help me, I am confused. :( >> >> On Wed, May 15, 2019 at 7:28 PM Chetan Khatri < >> chetan.opensou...@gmail.com>

Re: Suggestion on Join Approach with Spark

2019-05-15 Thread Chetan Khatri
Any one help me, I am confused. :( On Wed, May 15, 2019 at 7:28 PM Chetan Khatri wrote: > Hello Spark Developers, > > I have a question on Spark Join I am doing. > > I have a full load data from RDBMS and storing at HDFS let's say, > > val historyDF = spark.read.parque

Suggestion on Join Approach with Spark

2019-05-15 Thread Chetan Khatri
Hello Spark Developers, I have a question on Spark Join I am doing. I have a full load data from RDBMS and storing at HDFS let's say, val historyDF = spark.read.parquet(*"/home/test/transaction-line-item"*) and I am getting changed data at seperate hdfs path,let's say; val deltaDF = spark.read

Re: Need help for Delta.io

2019-05-10 Thread Chetan Khatri
Any thoughts.. Please On Fri, May 10, 2019 at 2:22 AM Chetan Khatri wrote: > Hello All, > > I need your help / suggestions, > > I am using Spark 2.3.1 with HDP 2.6.1 Distribution, I will tell my use > case so you get it where people are trying to use Delta. > My use case

How to parallelize JDBC Read in Spark

2018-09-06 Thread Chetan Khatri
Hello Dev Users, I am struggling to parallelize JDBC Read in Spark, It is using 1 - 2 task only to read data and taking so much of time to read. Ex. val invoiceLineItemDF = ((spark.read.jdbc(url = t360jdbcURL, table = invoiceLineItemQuery, columnName = "INVOICE_LINE_ITEM_ID", lowerBound =

Re: Select top (100) percent equivalent in spark

2018-09-05 Thread Chetan Khatri
Sean, Thank you. Do you think, tempDF.orderBy($"invoice_id".desc).limit(100) this can give same result , I think so. Thanks On Wed, Sep 5, 2018 at 12:58 AM Sean Owen wrote: > Sort and take head(n)? > > On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri > wrote: > &g

Re: Select top (100) percent equivalent in spark

2018-09-04 Thread Chetan Khatri
ink doing a order and limit would be equivalent after > optimizations. > > On Tue, Sep 4, 2018 at 2:28 PM Sean Owen wrote: > >> Sort and take head(n)? >> >> On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri < >> chetan.opensou...@gmail.com> wrote: >> >>> Dear Spark dev, anything equivalent in spark ? >>> >>

Select top (100) percent equivalent in spark

2018-09-04 Thread Chetan Khatri
Dear Spark dev, anything equivalent in spark ?

Re: Reading 20 GB of log files from Directory - Out of Memory Error

2018-08-25 Thread Chetan Khatri
= textlogRDD.flatMap { x => x.split("[^A-Za-z']+")}.map { y => y.replaceAll("""\n""", " ")} textMappedRDD.collect() 3. val tempRDD = sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*", 200).flatMap(files => file

Reading 20 GB of log files from Directory - Out of Memory Error

2018-08-25 Thread Chetan Khatri
Hello Spark Dev Community, Friend of mine is facing issue while reading 20 GB of log files from Directory on Cluster. Approach are as below: *1. This gives out of memory error.* val logRDD = sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*") val mappedRDD = logRDD.flatMa

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-15 Thread Chetan Khatri
n.html > > We will continue adding more there. > > Feel free to ping me directly in case of questions. > > Thanks, > Jayant > > > On Mon, Jul 9, 2018 at 9:56 PM, Chetan Khatri > wrote: > >> Hello Jayant, >> >> Thank you so much for suggestion.

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-09 Thread Chetan Khatri
Pandas Dataframe for processing and finally write the > results back. > > In the Spark/Scala/Java code, you get an RDD of string, which we convert > back to a Dataframe. > > Feel free to ping me directly in case of questions. > > Thanks, > Jayant > > > On Thu, Jul 5

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-05 Thread Chetan Khatri
Prem sure, Thanks for suggestion. On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure wrote: > try .pipe(.py) on RDD > > Thanks, > Prem > > On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri > wrote: > >> Can someone please suggest me , thanks >> >> On Tue 3 J

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-04 Thread Chetan Khatri
Can someone please suggest me , thanks On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, wrote: > Hello Dear Spark User / Dev, > > I would like to pass Python user defined function to Spark Job developed > using Scala and return value of that function would be returned to DF / > Datas

Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-03 Thread Chetan Khatri
Hello Dear Spark User / Dev, I would like to pass Python user defined function to Spark Job developed using Scala and return value of that function would be returned to DF / Dataset API. Can someone please guide me, which would be best approach to do this. Python function would be mostly transfor

Re: Spark Writing to parquet directory : java.io.IOException: Disk quota exceeded

2017-11-22 Thread Chetan Khatri
Anybody reply on this ? On Tue, Nov 21, 2017 at 3:36 PM, Chetan Khatri wrote: > > Hello Spark Users, > > I am getting below error, when i am trying to write dataset to parquet > location. I have enough disk space available. Last time i was facing same > kind of error whic

Divide Spark Dataframe to parts by timestamp

2017-11-12 Thread Chetan Khatri
Hello All, I have Spark Dataframe with timestamp from 2015-10-07 19:36:59 to 2017-01-01 18:53:23 If i want to split this Dataframe to 3 parts, I wrote below code to split it. Can anyone please confirm is this correct approach or not ?! val finalDF1 = sampleDF.where(sampleDF.col("timestamp_col").

Re: Joining 3 tables with 17 billions records

2017-11-02 Thread Chetan Khatri
Is this just a one time thing or something regular? > If it is a one time thing then I would tend more towards putting each > table in HDFS (parquet or ORC) and then join them. > What is the Hive and Spark version? > > Best regards > > > On 2. Nov 2017, at 20:57, Chetan Khatr

Joining 3 tables with 17 billions records

2017-11-02 Thread Chetan Khatri
Hello Spark Developers, I have 3 tables that i am reading from HBase and wants to do join transformation and save to Hive Parquet external table. Currently my join is failing with container failed error. 1. Read table A from Hbase with ~17 billion records. 2. repartition on primary key of table A

Apache Spark Streaming / Spark SQL Job logs

2017-08-30 Thread Chetan Khatri
Hey Spark Dev, Can anyone suggests sample Spark Streaming / Spark SQL Job logs to download. I want to play with Log analytics. Thanks

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-03 Thread Chetan Khatri
stly most people > find this number for their job "experimentally" (e.g. they try a few > different things). > > On Wed, Aug 2, 2017 at 1:52 PM, Chetan Khatri > wrote: > >> Ryan, >> Thank you for reply. >> >> For 2 TB of Data what should be the value of

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Chetan Khatri
ill be used for Spark execution, not reserved whatever is > consuming it and causing the OOM. (If Spark's memory is too low, you'll see > other problems like spilling too much to disk.) > > rb > > On Wed, Aug 2, 2017 at 9:02 AM, Chetan Khatri > wrote: > >

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Chetan Khatri
Can anyone please guide me with above issue. On Wed, Aug 2, 2017 at 6:28 PM, Chetan Khatri wrote: > Hello Spark Users, > > I have Hbase table reading and writing to Hive managed table where i > applied partitioning by date column which worked fine but it has generate > more num

Re: Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Chetan Khatri
I think it will be same, but let me try that FYR - https://issues.apache.org/jira/browse/SPARK-19881 On Fri, Jul 28, 2017 at 4:44 PM, ayan guha wrote: > Try running spark.sql("set yourconf=val") > > On Fri, 28 Jul 2017 at 8:51 pm, Chetan Khatri > wrote: > >> Jo

Re: Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Chetan Khatri
Jorn, Both are same. On Fri, Jul 28, 2017 at 4:18 PM, Jörn Franke wrote: > Try sparksession.conf().set > > On 28. Jul 2017, at 12:19, Chetan Khatri > wrote: > > Hey Dev/ USer, > > I am working with Spark 2.0.1 and with dynamic partitioning with H

Support Dynamic Partition Inserts params with SET command in Spark 2.0.1

2017-07-28 Thread Chetan Khatri
Hey Dev/ USer, I am working with Spark 2.0.1 and with dynamic partitioning with Hive facing below issue: org.apache.hadoop.hive.ql.metadata.HiveException: Number of dynamic partitions created is 1344, which is more than 1000. To solve this try to set hive.exec.max.dynamic.partitions to at least 1

Flatten JSON to multiple columns in Spark

2017-07-17 Thread Chetan Khatri
Hello Spark Dev's, Can you please guide me, how to flatten JSON to multiple columns in Spark. *Example:* Sr No Title ISBN Info 1 Calculus Theory 1234567890 [{"cert":[{ "authSbmtr":"009415da-c8cd-418d-869e-0a19601d79fa", 009415da-c8cd-418d-869e-0a19601d79fa "certUUID":"03ea5a1a-5530-4fa3-8871-9d1

Re: Issues: Generate JSON with null values in Spark 2.0.x

2017-03-20 Thread Chetan Khatri
ot in omitted form, like: > > { > "first_name": "Dongjin" > } > > right? > > - Dongjin > > On Wed, Mar 8, 2017 at 5:58 AM, Chetan Khatri > wrote: > >> Hello Dev / Users, >> >> I am working with PySpark Code migration to

Issues: Generate JSON with null values in Spark 2.0.x

2017-03-07 Thread Chetan Khatri
Hello Dev / Users, I am working with PySpark Code migration to scala, with Python - Iterating Spark with dictionary and generating JSON with null is possible with json.dumps() which will be converted to SparkSQL[Row] but in scala how can we generate json will null values as a Dataframe ? Thanks.

Re: Spark Job Performance monitoring approaches

2017-02-15 Thread Chetan Khatri
> github.com/SparkMonitor/varOne https://github.com/groupon/sparklint > > Chetan Khatri schrieb am Do., 16. Feb. 2017 > um 06:15 Uhr: > >> Hello All, >> >> What would be the best approches to monitor Spark Performance, is there >> any tools for Spark Job Performance monitoring ? >> >> Thanks. >> >

Spark Job Performance monitoring approaches

2017-02-15 Thread Chetan Khatri
Hello All, What would be the best approches to monitor Spark Performance, is there any tools for Spark Job Performance monitoring ? Thanks.

Re: Update Public Documentation - SparkSession instead of SparkContext

2017-02-15 Thread Chetan Khatri
d, Feb 15, 2017, 06:44 Chetan Khatri > wrote: > >> Hello Spark Dev Team, >> >> I was working with my team having most of the confusion that why your >> public documentation is not updated with SparkSession if SparkSession is >> the ongoing extension and best practice instead of creating sparkcontext. >> >> Thanks. >> >

Update Public Documentation - SparkSession instead of SparkContext

2017-02-14 Thread Chetan Khatri
Hello Spark Dev Team, I was working with my team having most of the confusion that why your public documentation is not updated with SparkSession if SparkSession is the ongoing extension and best practice instead of creating sparkcontext. Thanks.

Re: Error Saving Dataframe to Hive with Spark 2.0.0

2017-01-29 Thread Chetan Khatri
> since. > > Jacek > > > On 29 Jan 2017 9:24 a.m., "Chetan Khatri" > wrote: > > Hello Spark Users, > > I am getting error while saving Spark Dataframe to Hive Table: > Hive 1.2.1 > Spark 2.0.0 > Local environment. > Note: Job is getting execut

Re: HBaseContext with Spark

2017-01-27 Thread Chetan Khatri
TotalOrderPartitioner (sorts data, producing a large number of region files) Import HFiles into HBase HBase can merge files if necessary On Sat, Jan 28, 2017 at 11:32 AM, Chetan Khatri wrote: > @Ted, I dont think so. > > On Thu, Jan 26, 2017 at 6:35 AM, Ted Yu wrote: > >> Does t

Re: HBaseContext with Spark

2017-01-27 Thread Chetan Khatri
use Hive EXTERNAL TABLE > with > > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'. > > > Try this if you problem can be solved > > > https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration > > > Regards > > Amrit > > >

Re: HBaseContext with Spark

2017-01-25 Thread Chetan Khatri
Yu wrote: > Though no hbase release has the hbase-spark module, you can find the > backport patch on HBASE-14160 (for Spark 1.6) > > You can build the hbase-spark module yourself. > > Cheers > > On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri < > chetan.opensou...@gmai

HBaseContext with Spark

2017-01-25 Thread Chetan Khatri
Hello Spark Community Folks, Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk Load from Hbase to Hive. I have seen couple of good example at HBase Github Repo: https://github.com/ apache/hbase/tree/master/hbase-spark If I would like to use HBaseContext with HBase 1.2.4, how

Re: Weird experience Hive with Spark Transformations

2017-01-17 Thread Chetan Khatri
/hive-site.xml /usr/local/spark/conf > > If you want to use the existing Hive metastore, you need to provide that > information to Spark. > > Bests, > Dongjoon. > > On 2017-01-16 21:36 (-0800), Chetan Khatri > wrote: > > Hello, > > > > I have following

Weird experience Hive with Spark Transformations

2017-01-16 Thread Chetan Khatri
Hello, I have following services are configured and installed successfully: Hadoop 2.7.x Spark 2.0.x HBase 1.2.4 Hive 1.2.1 *Installation Directories:* /usr/local/hadoop /usr/local/spark /usr/local/hbase *Hive Environment variables:* #HIVE VARIABLES START export HIVE_HOME=/usr/local/hive expo

Re: About saving DataFrame to Hive 1.2.1 with Spark 2.0.1

2017-01-16 Thread Chetan Khatri
chema.struct); stdDf: org.apache.spark.sql.DataFrame = [stid: string, name: string ... 3 more fields] Thanks. On Tue, Jan 17, 2017 at 12:48 AM, Chetan Khatri wrote: > Hello Community, > > I am struggling to save Dataframe to Hive Table, > > Versions: > > Hive 1.2.

About saving DataFrame to Hive 1.2.1 with Spark 2.0.1

2017-01-16 Thread Chetan Khatri
Hello Community, I am struggling to save Dataframe to Hive Table, Versions: Hive 1.2.1 Spark 2.0.1 *Working code:* /* @Author: Chetan Khatri /* @Author: Chetan Khatri Description: This Scala script has written for HBase to Hive module, which reads table from HBase and dump it out to Hive

Re: Error at starting Phoenix shell with HBase

2017-01-15 Thread Chetan Khatri
h. > > I would check the RegionServer logs -- I'm guessing that it never started > correctly or failed. The error message is saying that certain regions in > the system were never assigned to a RegionServer which only happens in > exceptional cases. > > Chetan Khatri wrote

Re: Approach: Incremental data load from HBASE

2017-01-06 Thread Chetan Khatri
Ayan, Thanks Correct I am not thinking RDBMS terms, i am wearing NoSQL glasses ! On Fri, Jan 6, 2017 at 3:23 PM, ayan guha wrote: > IMHO you should not "think" HBase in RDMBS terms, but you can use > ColumnFilters to filter out new records > > On Fri, Jan 6, 2017 at

Re: Approach: Incremental data load from HBASE

2017-01-06 Thread Chetan Khatri
t at Row level. > > On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Ted Yu, >> >> You understood wrong, i said Incremental load from HBase to Hive, >> individually you can say Incremental Import f

Re: Approach: Incremental data load from HBASE

2017-01-04 Thread Chetan Khatri
using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the > data into hbase. > > For your use case, the producer needs to find rows where the flag is 0 or > 1. > After such rows are obtained, it is up to you how the result of processing > is delivered to hbase. > > Cheers > > On Wed, De

Re: Dependency Injection and Microservice development with Spark

2017-01-04 Thread Chetan Khatri
tlS, https://freebusy.io/la...@mapflat.com > > > On Fri, Dec 23, 2016 at 11:56 AM, Chetan Khatri > wrote: > > Hello Community, > > > > Current approach I am using for Spark Job Development with Scala + SBT > and > > Uber Jar with yml properties file to pass config

Re: Apache Hive with Spark Configuration

2017-01-04 Thread Chetan Khatri
nd we've found (from having different > versions as well) that older versions are mostly compatible. Some things > fail occasionally, but we haven't had too many problems running different > versions with the same metastore in practice. > > rb > > On Wed, Dec 28

Re: Error: at sqlContext.createDataFrame with RDD and Schema

2016-12-28 Thread Chetan Khatri
, unable to check with error that what exactly is. Thanks., On Wed, Dec 28, 2016 at 9:00 PM, Chetan Khatri wrote: > Hello Spark Community, > > I am reading HBase table from Spark and getting RDD but now i wants to > convert RDD of Spark Rows and want to convert to DF. >

Error: at sqlContext.createDataFrame with RDD and Schema

2016-12-28 Thread Chetan Khatri
Hello Spark Community, I am reading HBase table from Spark and getting RDD but now i wants to convert RDD of Spark Rows and want to convert to DF. *Source Code:* bin/spark-shell --packages it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 --conf spark.hbase.host=127.0.0.1 import it.nerdamme

Apache Hive with Spark Configuration

2016-12-28 Thread Chetan Khatri
Hello Users / Developers, I am using Hive 2.0.1 with MySql as a Metastore, can you tell me which version is more compatible with Spark 2.0.2 ? THanks

Re: Negative number of active tasks

2016-12-23 Thread Chetan Khatri
Could you share Pseudo code for the same. Cheers! C Khatri. On Fri, Dec 23, 2016 at 4:33 PM, Andy Dang wrote: > Hi all, > > Today I hit a weird bug in Spark 2.0.2 (vanilla Spark) - the executor tab > shows negative number of active tasks. > > I have about 25 jobs, each with 20k tasks so the nu

Re: Approach: Incremental data load from HBASE

2016-12-23 Thread Chetan Khatri
> After such rows are obtained, it is up to you how the result of processing > is delivered to hbase. > > Cheers > > On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Ok, Sure will ask. >> >> But what would be

Re: Best Practice for Spark Job Jar Generation

2016-12-23 Thread Chetan Khatri
dy > > On Fri, Dec 23, 2016 at 6:00 PM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Andy, Thanks for reply. >> >> If we download all the dependencies at separate location and link with >> spark job jar on spark cluster, is it best way to execute

Re: Best Practice for Spark Job Jar Generation

2016-12-23 Thread Chetan Khatri
us). > > --- > Regards, > Andy > > On Fri, Dec 23, 2016 at 6:44 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Hello Spark Community, >> >> For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and >>

Dependency Injection and Microservice development with Spark

2016-12-23 Thread Chetan Khatri
Hello Community, Current approach I am using for Spark Job Development with Scala + SBT and Uber Jar with yml properties file to pass configuration parameters. But If i would like to use Dependency Injection and MicroService Development like Spring Boot feature in Scala then what would be the stan

Best Practice for Spark Job Jar Generation

2016-12-22 Thread Chetan Khatri
h for Uber Less Jar, Guys can you please explain me best practice industry standard for the same. Thanks, Chetan Khatri.

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Chetan Khatri
> > > On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri < > chetan.opensou...@gmail.com> wrote: > >> Hello Guys, >> >> I would like to understand different approach for Distributed Incremental >> load from HBase, Is there any *tool / incubactor tool* which

Approach: Incremental data load from HBASE

2016-12-21 Thread Chetan Khatri
batch where flag is 0 or 1. I am looking for best practice approach with any distributed tool. Thanks. - Chetan Khatri