(i.e. this list) is for discussions about the development of
> Spark itself.
>
> On Wed, May 15, 2019 at 1:50 PM Chetan Khatri
> wrote:
>
>> Any one help me, I am confused. :(
>>
>> On Wed, May 15, 2019 at 7:28 PM Chetan Khatri <
>> chetan.opensou...@gmail.com>
Any one help me, I am confused. :(
On Wed, May 15, 2019 at 7:28 PM Chetan Khatri
wrote:
> Hello Spark Developers,
>
> I have a question on Spark Join I am doing.
>
> I have a full load data from RDBMS and storing at HDFS let's say,
>
> val historyDF = spark.read.parque
Hello Spark Developers,
I have a question on Spark Join I am doing.
I have a full load data from RDBMS and storing at HDFS let's say,
val historyDF = spark.read.parquet(*"/home/test/transaction-line-item"*)
and I am getting changed data at seperate hdfs path,let's say;
val deltaDF = spark.read
Any thoughts.. Please
On Fri, May 10, 2019 at 2:22 AM Chetan Khatri
wrote:
> Hello All,
>
> I need your help / suggestions,
>
> I am using Spark 2.3.1 with HDP 2.6.1 Distribution, I will tell my use
> case so you get it where people are trying to use Delta.
> My use case
Hello Dev Users,
I am struggling to parallelize JDBC Read in Spark, It is using 1 - 2 task
only to read data and taking so much of time to read.
Ex.
val invoiceLineItemDF = ((spark.read.jdbc(url = t360jdbcURL,
table = invoiceLineItemQuery,
columnName = "INVOICE_LINE_ITEM_ID",
lowerBound =
Sean, Thank you.
Do you think, tempDF.orderBy($"invoice_id".desc).limit(100)
this can give same result , I think so.
Thanks
On Wed, Sep 5, 2018 at 12:58 AM Sean Owen wrote:
> Sort and take head(n)?
>
> On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri
> wrote:
>
&g
ink doing a order and limit would be equivalent after
> optimizations.
>
> On Tue, Sep 4, 2018 at 2:28 PM Sean Owen wrote:
>
>> Sort and take head(n)?
>>
>> On Tue, Sep 4, 2018 at 12:07 PM Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Dear Spark dev, anything equivalent in spark ?
>>>
>>
Dear Spark dev, anything equivalent in spark ?
= textlogRDD.flatMap { x => x.split("[^A-Za-z']+")}.map {
y => y.replaceAll("""\n""", " ")}
textMappedRDD.collect()
3.
val tempRDD =
sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*",
200).flatMap(files => file
Hello Spark Dev Community,
Friend of mine is facing issue while reading 20 GB of log files from
Directory on Cluster.
Approach are as below:
*1. This gives out of memory error.*
val logRDD =
sc.wholeTextFiles("file:/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/*")
val mappedRDD = logRDD.flatMa
n.html
>
> We will continue adding more there.
>
> Feel free to ping me directly in case of questions.
>
> Thanks,
> Jayant
>
>
> On Mon, Jul 9, 2018 at 9:56 PM, Chetan Khatri > wrote:
>
>> Hello Jayant,
>>
>> Thank you so much for suggestion.
Pandas Dataframe for processing and finally write the
> results back.
>
> In the Spark/Scala/Java code, you get an RDD of string, which we convert
> back to a Dataframe.
>
> Feel free to ping me directly in case of questions.
>
> Thanks,
> Jayant
>
>
> On Thu, Jul 5
Prem sure, Thanks for suggestion.
On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure wrote:
> try .pipe(.py) on RDD
>
> Thanks,
> Prem
>
> On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri > wrote:
>
>> Can someone please suggest me , thanks
>>
>> On Tue 3 J
Can someone please suggest me , thanks
On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri,
wrote:
> Hello Dear Spark User / Dev,
>
> I would like to pass Python user defined function to Spark Job developed
> using Scala and return value of that function would be returned to DF /
> Datas
Hello Dear Spark User / Dev,
I would like to pass Python user defined function to Spark Job developed
using Scala and return value of that function would be returned to DF /
Dataset API.
Can someone please guide me, which would be best approach to do this.
Python function would be mostly transfor
Anybody reply on this ?
On Tue, Nov 21, 2017 at 3:36 PM, Chetan Khatri
wrote:
>
> Hello Spark Users,
>
> I am getting below error, when i am trying to write dataset to parquet
> location. I have enough disk space available. Last time i was facing same
> kind of error whic
Hello All,
I have Spark Dataframe with timestamp from 2015-10-07 19:36:59 to
2017-01-01 18:53:23
If i want to split this Dataframe to 3 parts, I wrote below code to split
it. Can anyone please confirm is this correct approach or not ?!
val finalDF1 = sampleDF.where(sampleDF.col("timestamp_col").
Is this just a one time thing or something regular?
> If it is a one time thing then I would tend more towards putting each
> table in HDFS (parquet or ORC) and then join them.
> What is the Hive and Spark version?
>
> Best regards
>
> > On 2. Nov 2017, at 20:57, Chetan Khatr
Hello Spark Developers,
I have 3 tables that i am reading from HBase and wants to do join
transformation and save to Hive Parquet external table. Currently my join
is failing with container failed error.
1. Read table A from Hbase with ~17 billion records.
2. repartition on primary key of table A
Hey Spark Dev,
Can anyone suggests sample Spark Streaming / Spark SQL Job logs to
download. I want to play with Log analytics.
Thanks
stly most people
> find this number for their job "experimentally" (e.g. they try a few
> different things).
>
> On Wed, Aug 2, 2017 at 1:52 PM, Chetan Khatri > wrote:
>
>> Ryan,
>> Thank you for reply.
>>
>> For 2 TB of Data what should be the value of
ill be used for Spark execution, not reserved whatever is
> consuming it and causing the OOM. (If Spark's memory is too low, you'll see
> other problems like spilling too much to disk.)
>
> rb
>
> On Wed, Aug 2, 2017 at 9:02 AM, Chetan Khatri > wrote:
>
>
Can anyone please guide me with above issue.
On Wed, Aug 2, 2017 at 6:28 PM, Chetan Khatri
wrote:
> Hello Spark Users,
>
> I have Hbase table reading and writing to Hive managed table where i
> applied partitioning by date column which worked fine but it has generate
> more num
I think it will be same, but let me try that
FYR - https://issues.apache.org/jira/browse/SPARK-19881
On Fri, Jul 28, 2017 at 4:44 PM, ayan guha wrote:
> Try running spark.sql("set yourconf=val")
>
> On Fri, 28 Jul 2017 at 8:51 pm, Chetan Khatri
> wrote:
>
>> Jo
Jorn, Both are same.
On Fri, Jul 28, 2017 at 4:18 PM, Jörn Franke wrote:
> Try sparksession.conf().set
>
> On 28. Jul 2017, at 12:19, Chetan Khatri
> wrote:
>
> Hey Dev/ USer,
>
> I am working with Spark 2.0.1 and with dynamic partitioning with H
Hey Dev/ USer,
I am working with Spark 2.0.1 and with dynamic partitioning with Hive
facing below issue:
org.apache.hadoop.hive.ql.metadata.HiveException:
Number of dynamic partitions created is 1344, which is more than 1000.
To solve this try to set hive.exec.max.dynamic.partitions to at least 1
Hello Spark Dev's,
Can you please guide me, how to flatten JSON to multiple columns in Spark.
*Example:*
Sr No Title ISBN Info
1 Calculus Theory 1234567890 [{"cert":[{
"authSbmtr":"009415da-c8cd-418d-869e-0a19601d79fa",
009415da-c8cd-418d-869e-0a19601d79fa
"certUUID":"03ea5a1a-5530-4fa3-8871-9d1
ot in omitted form, like:
>
> {
> "first_name": "Dongjin"
> }
>
> right?
>
> - Dongjin
>
> On Wed, Mar 8, 2017 at 5:58 AM, Chetan Khatri > wrote:
>
>> Hello Dev / Users,
>>
>> I am working with PySpark Code migration to
Hello Dev / Users,
I am working with PySpark Code migration to scala, with Python - Iterating
Spark with dictionary and generating JSON with null is possible with
json.dumps() which will be converted to SparkSQL[Row] but in scala how can
we generate json will null values as a Dataframe ?
Thanks.
> github.com/SparkMonitor/varOne https://github.com/groupon/sparklint
>
> Chetan Khatri schrieb am Do., 16. Feb. 2017
> um 06:15 Uhr:
>
>> Hello All,
>>
>> What would be the best approches to monitor Spark Performance, is there
>> any tools for Spark Job Performance monitoring ?
>>
>> Thanks.
>>
>
Hello All,
What would be the best approches to monitor Spark Performance, is there any
tools for Spark Job Performance monitoring ?
Thanks.
d, Feb 15, 2017, 06:44 Chetan Khatri
> wrote:
>
>> Hello Spark Dev Team,
>>
>> I was working with my team having most of the confusion that why your
>> public documentation is not updated with SparkSession if SparkSession is
>> the ongoing extension and best practice instead of creating sparkcontext.
>>
>> Thanks.
>>
>
Hello Spark Dev Team,
I was working with my team having most of the confusion that why your
public documentation is not updated with SparkSession if SparkSession is
the ongoing extension and best practice instead of creating sparkcontext.
Thanks.
> since.
>
> Jacek
>
>
> On 29 Jan 2017 9:24 a.m., "Chetan Khatri"
> wrote:
>
> Hello Spark Users,
>
> I am getting error while saving Spark Dataframe to Hive Table:
> Hive 1.2.1
> Spark 2.0.0
> Local environment.
> Note: Job is getting execut
TotalOrderPartitioner
(sorts data, producing a large number of region files)
Import HFiles into HBase
HBase can merge files if necessary
On Sat, Jan 28, 2017 at 11:32 AM, Chetan Khatri wrote:
> @Ted, I dont think so.
>
> On Thu, Jan 26, 2017 at 6:35 AM, Ted Yu wrote:
>
>> Does t
use Hive EXTERNAL TABLE
> with
>
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'.
>
>
> Try this if you problem can be solved
>
>
> https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
>
>
> Regards
>
> Amrit
>
>
>
Yu wrote:
> Though no hbase release has the hbase-spark module, you can find the
> backport patch on HBASE-14160 (for Spark 1.6)
>
> You can build the hbase-spark module yourself.
>
> Cheers
>
> On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri <
> chetan.opensou...@gmai
Hello Spark Community Folks,
Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk Load
from Hbase to Hive.
I have seen couple of good example at HBase Github Repo: https://github.com/
apache/hbase/tree/master/hbase-spark
If I would like to use HBaseContext with HBase 1.2.4, how
/hive-site.xml /usr/local/spark/conf
>
> If you want to use the existing Hive metastore, you need to provide that
> information to Spark.
>
> Bests,
> Dongjoon.
>
> On 2017-01-16 21:36 (-0800), Chetan Khatri
> wrote:
> > Hello,
> >
> > I have following
Hello,
I have following services are configured and installed successfully:
Hadoop 2.7.x
Spark 2.0.x
HBase 1.2.4
Hive 1.2.1
*Installation Directories:*
/usr/local/hadoop
/usr/local/spark
/usr/local/hbase
*Hive Environment variables:*
#HIVE VARIABLES START
export HIVE_HOME=/usr/local/hive
expo
chema.struct);
stdDf: org.apache.spark.sql.DataFrame = [stid: string, name: string ... 3
more fields]
Thanks.
On Tue, Jan 17, 2017 at 12:48 AM, Chetan Khatri wrote:
> Hello Community,
>
> I am struggling to save Dataframe to Hive Table,
>
> Versions:
>
> Hive 1.2.
Hello Community,
I am struggling to save Dataframe to Hive Table,
Versions:
Hive 1.2.1
Spark 2.0.1
*Working code:*
/*
@Author: Chetan Khatri
/* @Author: Chetan Khatri Description: This Scala script has written for
HBase to Hive module, which reads table from HBase and dump it out to Hive
h.
>
> I would check the RegionServer logs -- I'm guessing that it never started
> correctly or failed. The error message is saying that certain regions in
> the system were never assigned to a RegionServer which only happens in
> exceptional cases.
>
> Chetan Khatri wrote
Ayan, Thanks
Correct I am not thinking RDBMS terms, i am wearing NoSQL glasses !
On Fri, Jan 6, 2017 at 3:23 PM, ayan guha wrote:
> IMHO you should not "think" HBase in RDMBS terms, but you can use
> ColumnFilters to filter out new records
>
> On Fri, Jan 6, 2017 at
t at Row level.
>
> On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Ted Yu,
>>
>> You understood wrong, i said Incremental load from HBase to Hive,
>> individually you can say Incremental Import f
using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the
> data into hbase.
>
> For your use case, the producer needs to find rows where the flag is 0 or
> 1.
> After such rows are obtained, it is up to you how the result of processing
> is delivered to hbase.
>
> Cheers
>
> On Wed, De
tlS, https://freebusy.io/la...@mapflat.com
>
>
> On Fri, Dec 23, 2016 at 11:56 AM, Chetan Khatri
> wrote:
> > Hello Community,
> >
> > Current approach I am using for Spark Job Development with Scala + SBT
> and
> > Uber Jar with yml properties file to pass config
nd we've found (from having different
> versions as well) that older versions are mostly compatible. Some things
> fail occasionally, but we haven't had too many problems running different
> versions with the same metastore in practice.
>
> rb
>
> On Wed, Dec 28
, unable to check with error that what exactly is.
Thanks.,
On Wed, Dec 28, 2016 at 9:00 PM, Chetan Khatri
wrote:
> Hello Spark Community,
>
> I am reading HBase table from Spark and getting RDD but now i wants to
> convert RDD of Spark Rows and want to convert to DF.
>
Hello Spark Community,
I am reading HBase table from Spark and getting RDD but now i wants to
convert RDD of Spark Rows and want to convert to DF.
*Source Code:*
bin/spark-shell --packages
it.nerdammer.bigdata:spark-hbase-connector_2.10:1.0.3 --conf
spark.hbase.host=127.0.0.1
import it.nerdamme
Hello Users / Developers,
I am using Hive 2.0.1 with MySql as a Metastore, can you tell me which
version is more compatible with Spark 2.0.2 ?
THanks
Could you share Pseudo code for the same.
Cheers!
C Khatri.
On Fri, Dec 23, 2016 at 4:33 PM, Andy Dang wrote:
> Hi all,
>
> Today I hit a weird bug in Spark 2.0.2 (vanilla Spark) - the executor tab
> shows negative number of active tasks.
>
> I have about 25 jobs, each with 20k tasks so the nu
> After such rows are obtained, it is up to you how the result of processing
> is delivered to hbase.
>
> Cheers
>
> On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Ok, Sure will ask.
>>
>> But what would be
dy
>
> On Fri, Dec 23, 2016 at 6:00 PM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Andy, Thanks for reply.
>>
>> If we download all the dependencies at separate location and link with
>> spark job jar on spark cluster, is it best way to execute
us).
>
> ---
> Regards,
> Andy
>
> On Fri, Dec 23, 2016 at 6:44 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Hello Spark Community,
>>
>> For Spark Job Creation I use SBT Assembly to build Uber("Super") Jar and
>>
Hello Community,
Current approach I am using for Spark Job Development with Scala + SBT and
Uber Jar with yml properties file to pass configuration parameters. But If
i would like to use Dependency Injection and MicroService Development like
Spring Boot feature in Scala then what would be the stan
h for Uber Less Jar, Guys can you please
explain me best practice industry standard for the same.
Thanks,
Chetan Khatri.
>
>
> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
> chetan.opensou...@gmail.com> wrote:
>
>> Hello Guys,
>>
>> I would like to understand different approach for Distributed Incremental
>> load from HBase, Is there any *tool / incubactor tool* which
batch where flag is 0 or 1.
I am looking for best practice approach with any distributed tool.
Thanks.
- Chetan Khatri
59 matches
Mail list logo