Re: [Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

2020-02-24 Thread Jianneng Li
Thanks Genie. Unfortunately, the joins I'm doing in this case are large, so UDF likely won't work. Jianneng From: Liu Genie Sent: Monday, February 24, 2020 6:39 PM To: user@spark.apache.org Subject: Re: [Spark SQL] Memory problems with packing too many joins

Re: [Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

2020-02-24 Thread Liu Genie
odegen generates code that appends results<https://github.com/apache/spark/blob/v3.0.0-preview2/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L771> into a BufferedRowIterator, which keeps the results in an in-memory linked list<https://github.com/ap

[Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

2020-02-24 Thread Jianneng Li
Hello everyone, WholeStageCodegen generates code that appends results<https://github.com/apache/spark/blob/v3.0.0-preview2/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L771> into a BufferedRowIterator, which keeps the results in an in-memory linked

Re: [Spark SQL] NegativeArraySizeException When Parse InternalRow to DTO Field with Type Array[String]

2020-02-23 Thread Sandeep Patra
counter below NegativeArraySizeException when run Spark SQL. The > catalyst generated code for "apply2_19" and "apply1_11" is attached and > also the related DTO. > Difficult to understand how the problem could happen, please help if any > idea. > > I can see mayb

[Spark SQL] NegativeArraySizeException When Parse InternalRow to DTO Field with Type Array[String]

2020-02-23 Thread Proust (Feng Guizhou) [Travel Search & Discovery]
Hi, Spark Users I ecounter below NegativeArraySizeException when run Spark SQL. The catalyst generated code for "apply2_19" and "apply1_11" is attached and also the related DTO. Difficult to understand how the problem could happen, please help if any idea. I

Integration testing Framework Spark SQL Scala

2020-02-20 Thread Ruijing Li
Hi all, I’m interested in hearing the community’s thoughts on best practices to do integration testing for spark sql jobs. We run a lot of our jobs with cloud infrastructure and hdfs - this makes debugging a challenge for us, especially with problems that don’t occur from just initializing

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Gourav Sengupta
Sengupta On Wed, Jan 15, 2020 at 9:10 PM Kalin Stoyanov wrote: > Hi all, > > @Enrico, I've added just the SQL query pages (+js dependencies etc.) in > the google drive - > https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing > That is what you

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov
Hi all, @Enrico, I've added just the SQL query pages (+js dependencies etc.) in the google drive - https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing That is what you had in mind right? They are different indeed. (For some reason after I saved them off

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Xiao Li
If you can confirm that this is caused by Apache Spark, feel free to open a JIRA. In each release, I do not expect your queries should hit such a major performance regression. Also, please try the 3.0 preview releases. Thanks, Xiao Kalin Stoyanov 于2020年1月15日周三 上午10:53写道: > Hi Xiao, > >

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Gourav Sengupta
Hi, I am pretty sure that AWS has released 5.28.1 with some bug fixes day before yesterday. Also please ensure that you are using s3:// instead of s3a:// or anything like that. On another note, Xiao, is not entirely right in mentioning about issues in EMR not to be posted here, a large group of

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov
Hi Xiao, Thanks, I didn't know that. This https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/ implies that their fork is not used in emr 5.27. I tried that and it has the same issue. But then again in their article they were comparing emr 5.27 vs 5.16 so I

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Xiao Li
EMR is having their own fork of Spark, called EMR runtime. They are not Apache Spark. You might need to talk with them instead of posting questions in the Apache Spark community. Cheers, Xiao Kalin Stoyanov 于2020年1月15日周三 上午9:53写道: > Hi all, > > First of all let me say that I am pretty new to

Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

2020-01-15 Thread Kalin Stoyanov
Hi all, First of all let me say that I am pretty new to Spark so this could be entirely my fault somehow... I noticed this when I was running a job on an amazon emr cluster with Spark 2.4.4, and it got done slower than when I had ran it locally (on Spark 2.4.1). I checked out the event logs, and

Performance advantage of Spark SQL versus CSL API

2019-12-24 Thread Rajev Agarwal
Hello, I am wondering whether there is a clear-cut performance advantage for using CSL API instead of Spark SQL for queries in Java? I am interested in Joins, Aggregates, and, Group By (with several fields) clauses. Thank you. RajevA

Re: Issue With mod function in Spark SQL

2019-12-17 Thread Enrico Minack
l even or all odd? On Tue, Dec 17, 2019 at 11:01 AM Tzahi File mailto:tzahi.f...@ironsrc.com>> wrote: I have in my spark sql query a calculated field that gets the value if field1 % 3. I'm using this field as a partition so I expected to get 3

Re: Issue With mod function in Spark SQL

2019-12-17 Thread Tzahi File
no.. there're 100M records both even and odd On Tue, Dec 17, 2019 at 8:13 PM Russell Spitzer wrote: > Is there a chance your data is all even or all odd? > > On Tue, Dec 17, 2019 at 11:01 AM Tzahi File > wrote: > >> I have in my spark sql query a calculated fiel

Re: Issue With mod function in Spark SQL

2019-12-17 Thread Russell Spitzer
Is there a chance your data is all even or all odd? On Tue, Dec 17, 2019 at 11:01 AM Tzahi File wrote: > I have in my spark sql query a calculated field that gets the value if > field1 % 3. > > I'm using this field as a partition so I expected to get 3 partitions in > the mentio

Issue With mod function in Spark SQL

2019-12-17 Thread Tzahi File
I have in my spark sql query a calculated field that gets the value if field1 % 3. I'm using this field as a partition so I expected to get 3 partitions in the mentioned case, and I do get. The issue happened with even numbers (instead of 3 - 4,2 ... ). When I tried to use even numbers

Re: [Spark SQL]: Does namespace name is always needed in a query for tables from a user defined catalog plugin

2019-12-01 Thread xufei
Thanks, Terry. Glad to know that it is not an expected behavior. Terry Kim 于2019年12月2日周一 上午11:51写道: > Hi Xufei, > I also noticed the same while looking into relation resolution behavior > (See Appendix A in this doc >

Re: [Spark SQL]: Does namespace name is always needed in a query for tables from a user defined catalog plugin

2019-12-01 Thread Terry Kim
Hi Xufei, I also noticed the same while looking into relation resolution behavior (See Appendix A in this doc ). I created SPARK-30094 and will

[Spark SQL]: Does namespace name is always needed in a query for tables from a user defined catalog plugin

2019-12-01 Thread xufei
Hi, I'm trying to write a catalog plugin based on spark-3.0-preview, and I found even when I use 'use catalog.namespace' to set the current catalog and namespace, I still need to qualified name in the query. For example, I add a catalog named 'example_catalog', there is a database named 'test'

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-12 Thread Bryan Cutler
Thanks all. I created a WIP PR at https://github.com/apache/spark/pull/26496, we can further discuss the details in there. On Thu, Nov 7, 2019 at 7:01 PM Takuya UESHIN wrote: > +1 > > On Thu, Nov 7, 2019 at 6:54 PM Shane Knapp wrote: > >> +1 >> >> On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon

Temporary tables for Spark SQL

2019-11-12 Thread Laurent Bastien Corbeil
Hello, I am new to Spark, so I have a basic question which I couldn't find an answer online. If I want to run SQL queries on a Spark dataframe, do I have to create a temporary table first? I know I could use the Spark SQL API, but is there a way of simply reading the data and run SQL queries

Re: Using Percentile in Spark SQL

2019-11-11 Thread Jerry Vinokurov
t;>> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File >>> wrote: >>> >>>> Hi, >>>> >>>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a >>>> percentile function. I'm trying to improve this job by movin

Re: Using Percentile in Spark SQL

2019-11-11 Thread Tzahi File
huge cluster(m5.24xl * 40 workers) to run a >>> percentile function. I'm trying to improve this job by moving it to run >>> with spark SQL. >>> >>> Any suggestions on how to use a percentile function in Spark? >>> >>> >>> Thanks, >>> -

Re: Using Percentile in Spark SQL

2019-11-11 Thread Muthu Jayakumar
; Hi, >>> >>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a >>> percentile function. I'm trying to improve this job by moving it to run >>> with spark SQL. >>> >>> Any suggestions on how to use a percentile function in Spa

Re: Using Percentile in Spark SQL

2019-11-11 Thread Patrick McCarthy
gt;> >> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a >> percentile function. I'm trying to improve this job by moving it to run >> with spark SQL. >> >> Any suggestions on how to use a percentile function in Spark? >> >>

Re: Using Percentile in Spark SQL

2019-11-11 Thread Jerry Vinokurov
for this task? Because I bet that's what's slowing you down. On Mon, Nov 11, 2019 at 9:46 AM Tzahi File wrote: > Hi, > > Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a > percentile function. I'm trying to improve this job by moving it to run > with spa

Using Percentile in Spark SQL

2019-11-11 Thread Tzahi File
Hi, Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a percentile function. I'm trying to improve this job by moving it to run with spark SQL. Any suggestions on how to use a percentile function in Spark? Thanks, -- Tzahi File Data Engineer [image: ironSource] <h

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-07 Thread Takuya UESHIN
+1 On Thu, Nov 7, 2019 at 6:54 PM Shane Knapp wrote: > +1 > > On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon wrote: > > > > +1 > > > > 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan 님이 작성: > >> > >> Sounds reasonable to me. We should make the behavior consistent within > Spark. > >> > >> On Tue, Nov 5,

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-07 Thread Shane Knapp
+1 On Thu, Nov 7, 2019 at 6:08 PM Hyukjin Kwon wrote: > > +1 > > 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan 님이 작성: >> >> Sounds reasonable to me. We should make the behavior consistent within Spark. >> >> On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler wrote: >>> >>> Currently, when a PySpark Row is

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-07 Thread Hyukjin Kwon
+1 2019년 11월 6일 (수) 오후 11:38, Wenchen Fan 님이 작성: > Sounds reasonable to me. We should make the behavior consistent within > Spark. > > On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler wrote: > >> Currently, when a PySpark Row is created with keyword arguments, the >> fields are sorted

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-06 Thread Wenchen Fan
Sounds reasonable to me. We should make the behavior consistent within Spark. On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler wrote: > Currently, when a PySpark Row is created with keyword arguments, the > fields are sorted alphabetically. This has created a lot of confusion with > users because it

[DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-04 Thread Bryan Cutler
Currently, when a PySpark Row is created with keyword arguments, the fields are sorted alphabetically. This has created a lot of confusion with users because it is not obvious (although it is stated in the pydocs) that they will be sorted alphabetically. Then later when applying a schema and the

[Spark SQL]: Dataframe group by potential bug (Scala)

2019-10-31 Thread ludwiggj
This is using Spark Scala 2.4.4. I'm getting some very strange behaviour after reading in a dataframe from a json file, using sparkSession.read in permissive mode. I've included the error column when reading in the data, as I want to log details of any errors in the input json file. My suspicion

[Spark Sql] Direct write on hive and s3 while executing a CTAS on spark sql

2019-10-24 Thread francexo83
Hi all, I'm using spark 2.4.0, my spark.sql.catalogImplementation is set to hive while spark.sql.warehouse.dir is set to a specific s3 bucket. I want to execute a CTAS statement in spark sql like the one below. *create table as db_name.table_name as (select ..)* When writing, spark always uses

convert josn string in spark sql

2019-10-16 Thread amit kumar singh
Hi Team, I have kafka messages where json is coming as string how can create table after converting json string to json using spark sql

Re: Use our own metastore with Spark SQL

2019-10-14 Thread Zhu, Luke
everything Hadoop, you can also implement ExternalCatalog: https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala See https://jira.apache.org/jira/browse/SPARK-23443 for ongoing progress

Use our own metastore with Spark SQL

2019-10-14 Thread xweb
Is it possible to use our own metastore instead of Hive Metastore with Spark SQL? Can you please point me to some docs or code I can look at to get it done? We are moving away from everything Hadoop. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com

question about spark sql join pruning

2019-10-08 Thread Shuo Chen
Hi! I have a question about spark left join pruning. For example, I have 2 tables: table A: create table A ( user_id int, gender string, email string, phone string, ) table B: create table B ( user_id int, jobs string, graduate_schools string ) If I select columns of A from

How to handle this use-case in spark-sql-streaming

2019-09-30 Thread Shyam P
Hi, I have scenario like below https://stackoverflow.com/questions/58134379/how-to-handle-backup-scenario-in-spark-structured-streaming-using-joins How to handle this use-case ( back-up scenario) in spark-structured-streaming? Any clues would be highly appreciated. Thanks, Shyam

Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-14 Thread Dhaval Patel
Hi Abhinesh, As drop duplicates keeps first record, you can keep some id for 1st and 2nd df and then Union -> sort on that id -> drop duplicates. This will ensure records from 1st df is kept and 2nd are dropped. Regards Dhaval On Sat, Sep 14, 2019 at 4:41 PM Abhinesh Hada wrote: > Hey

Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-14 Thread Abhinesh Hada
Hey Nathan, As the dataset is very huge, I am looking for ways that involve minimum joins. I will give a try to your approach. Thanks a lot for your help. On Sat, Sep 14, 2019 at 12:58 AM Nathan Kronenfeld wrote: > It's a bit of a pain, but you could just use an outer join (assuming there >

Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-13 Thread Nathan Kronenfeld
It's a bit of a pain, but you could just use an outer join (assuming there are no duplicates in the input datasets, of course): import org.apache.spark.sql.test.SharedSparkSession import org.scalatest.FunSpec class QuestionSpec extends FunSpec with SharedSparkSession { describe("spark list

Re: [Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-13 Thread Patrick McCarthy
If you only care that you're deduping on one of the fields you could add an index and count like so: df3 = df1.withColumn('idx',lit(1)) .union(df2.withColumn('idx',lit(2)) remove_df = df3 .groupBy('id') .agg(collect_set('idx').alias('set_size') .filter(size(col('set_size') > 1)) .select('id',

[Spark SQL]: Does Union operation followed by drop duplicate follows "keep first"

2019-09-13 Thread Abhinesh Hada
Hi, I am trying to take union of 2 dataframes and then drop duplicate based on the value of a specific column. But, I want to make sure that while dropping duplicates, the rows from first data frame are kept. Example: df1 = df1.union(df2).dropDuplicates(['id'])

How to query StructField's metadata in spark sql?

2019-09-05 Thread kyunam
Using SQL, is it possible to query a column's metadata? Thanks, Kyunam -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Jörn Franke
1) this is not a use case, but a technical solution. Hence nobody can tell you if it make sense or not 2) do an upsert in Cassandra. However keep in mind that the application submitting to the Kafka topic and the one consuming from the Kafka topic need to ensure that they process messages in

Re: Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Aayush Ranaut
What exactly is your requirement?  Is the read before write mandatory? Are you maintaining states in Cassandra? Regards Prathmesh Ranaut https://linkedin.com/in/prathmeshranaut > On Aug 29, 2019, at 3:35 PM, Shyam P wrote: > > > thanks Aayush.     For every record I need to get the data

Re: Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Shyam P
thanks Aayush. For every record I need to get the data from cassandra table and update it ? Else it may not update the existing record. What is this datastax-spark-connector ? is that not a "Cassandra connector library written for spark"? If not , how to write ourselves. Where and how to

Re: Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Aayush Ranaut
Cassandra is upsert, you should be able to do what you need with a single statement unless you’re looking to maintain counters. I’m not sure if there is a Cassandra connector library written for spark streaming because we wrote one ourselves when we wanted to do the same. Regards Prathmesh

Will this use-case can be handled with spark-sql streaming and cassandra?

2019-08-29 Thread Shyam P
Hi, I need to do a PoC for a business use-case. *Use case :* Need to update a record in Cassandra table if exists. Will spark streaming support compare each record and update existing Cassandra record ? For each record received from kakfa topic , If I want to check and compare each record

Re: [Spark SQL] failure in query

2019-08-29 Thread Subash Prabakar
:56, Tzahi File wrote: > Hi, > > I encountered some issue to run a spark SQL query, and will happy to some > advice. > I'm trying to run a query on a very big data set (around 1.5TB) and it > getting failures in all of my tries. A template of the query is as below: > insert over

[Spark SQL] failure in query

2019-08-25 Thread Tzahi File
Hi, I encountered some issue to run a spark SQL query, and will happy to some advice. I'm trying to run a query on a very big data set (around 1.5TB) and it getting failures in all of my tries. A template of the query is as below: insert overwrite table partition(part) select /*+ BROADCAST(c

Re: Spark SQL reads all leaf directories on a partitioned Hive table

2019-08-14 Thread Hao Ren
en I read that in order > for spark to be aware of all the partitions it first read the folders and > then updated its metastore . Then the sql is applied on TOP of it. Instead > of using the existing hive SerDe and this property is only for parquet > files. > > Hive metasto

Re: Any advice how to do this usecase in spark sql ?

2019-08-13 Thread Jörn Franke
on your use case. > Am 14.08.2019 um 05:08 schrieb Shyam P : > > Hi, > Any advice how to do this in spark sql ? > > I have a scenario as below > > dataframe1 = loaded from an HDFS Parquet file. > > dataframe2 = read from a Kafka Stream. > > If column1

Any advice how to do this usecase in spark sql ?

2019-08-13 Thread Shyam P
Hi, Any advice how to do this in spark sql ? I have a scenario as below dataframe1 = loaded from an HDFS Parquet file. dataframe2 = read from a Kafka Stream. If column1 of dataframe1 value in columnX value of dataframe2 , then I need then I need to replace column1 value of dataframe1

Re: Spark SQL reads all leaf directories on a partitioned Hive table

2019-08-12 Thread Subash Prabakar
and then updated its metastore . Then the sql is applied on TOP of it. Instead of using the existing hive SerDe and this property is only for parquet files. Hive metastore Parquet table conversion <https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#hive-metastore-parquet-table-conversion>

Re: Spark SQL reads all leaf directories on a partitioned Hive table

2019-08-09 Thread Hao Ren
of the one in question during the execution of the spark sql query. This is why this simple query takes too much time. I would like to know how to improve this by just reading the specific partition in question. Feel free to ask more questions if I am not clear. Best regards, Hao On Thu, Aug 8, 2019 at 9

Re: Spark SQL reads all leaf directories on a partitioned Hive table

2019-08-08 Thread Mich Talebzadeh
-- Forwarded message - > From: Hao Ren > Date: Thu, Aug 8, 2019 at 4:15 PM > Subject: Re: Spark SQL reads all leaf directories on a partitioned Hive > table > To: Gourav Sengupta > > > Hi Gourva, > > I am using enableHiveSupport. > The table was not created by Spa

Fwd: Spark SQL reads all leaf directories on a partitioned Hive table

2019-08-08 Thread Hao Ren
-- Forwarded message - From: Hao Ren Date: Thu, Aug 8, 2019 at 4:15 PM Subject: Re: Spark SQL reads all leaf directories on a partitioned Hive table To: Gourav Sengupta Hi Gourva, I am using enableHiveSupport. The table was not created by Spark. The table already exists

Re: Spark SQL reads all leaf directories on a partitioned Hive table

2019-08-08 Thread Gourav Sengupta
Hi, Just out of curiosity did you start the SPARK session using enableHiveSupport() ? Or are you creating the table using SPARK? Regards, Gourav On Wed, Aug 7, 2019 at 3:28 PM Hao Ren wrote: > Hi, > I am using Spark SQL 2.3.3 to read a hive table which is partitioned by >

Spark SQL reads all leaf directories on a partitioned Hive table

2019-08-07 Thread Hao Ren
Hi, I am using Spark SQL 2.3.3 to read a hive table which is partitioned by day, hour, platform, request_status and is_sampled. The underlying data is in parquet format on HDFS. Here is the SQL query to read just *one partition*. ``` spark.sql(""" SELECT rtb_platform_id,

[Spark SQL] dependencies to use test helpers

2019-07-24 Thread James Pirz
lob/fced6696a7713a5dc117860faef43db6b81d07b3/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala> [1]. I am just wondering which additional dependencies I need to add to my project to access them. Currently, I have below dependencies but they do not cover above APIs. libraryDepend

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Reynold Xin
No sorry I'm not at liberty to share other people's code. On Fri, Jul 12, 2019 at 9:33 AM, Gourav Sengupta < gourav.sengu...@gmail.com > wrote: > > Hi Reynold, > > > I am genuinely curious about queries which are more than 1 MB and am > stunned by tens of MB's. Any samples to share :)  > >

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Gourav Sengupta
Hi Reynold, I am genuinely curious about queries which are more than 1 MB and am stunned by tens of MB's. Any samples to share :) Regards, Gourav On Thu, Jul 11, 2019 at 5:03 PM Reynold Xin wrote: > There is no explicit limit but a JVM string cannot be bigger than 2G. It > will also at some

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-11 Thread Reynold Xin
There is no explicit limit but a JVM string cannot be bigger than 2G. It will also at some point run out of memory with too big of a query plan tree or become incredibly slow due to query planning complexity. I've seen queries that are tens of MBs in size. On Thu, Jul 11, 2019 at 5:01 AM, 李书明

Re: Unable to run simple spark-sql

2019-06-21 Thread Raymond Honderdors
S > i.e.hdfs://xxx:8020/apps/hive/warehouse/ > For this the code ran fine. > > Thanks for the help, > -Nirmal > > From: Nirmal Kumar > Sent: 19 June 2019 11:51 > To: Raymond Honderdors > Cc: user > Subject: RE: Unable to run simple spark-sql > > Hi Raymond, &

RE: Unable to run simple spark-sql

2019-06-21 Thread Nirmal Kumar
filesystem. I created a new database and confirmed that the location was in HDFS i.e.hdfs://xxx:8020/apps/hive/warehouse/ For this the code ran fine. Thanks for the help, -Nirmal From: Nirmal Kumar Sent: 19 June 2019 11:51 To: Raymond Honderdors Cc: user Subject: RE: Unable to run simple spark-sql

Re: Spark SQL

2019-06-19 Thread naresh Goud
Just to make it more clear, Spark sql uses hive metastore and run queries using its own engine and not uses hive execution engine. Please correct me if it’s not true. On Mon, Jun 10, 2019 at 2:29 PM Russell Spitzer wrote: > Spark can use the HiveMetastore as a catalog, but it doesn't

RE: Unable to run simple spark-sql

2019-06-19 Thread Nirmal Kumar
directory of hive user (/home/hive/). Why is it referring the local file system and from where? Thanks, Nirmal From: Raymond Honderdors Sent: 19 June 2019 11:18 To: Nirmal Kumar Cc: user Subject: Re: Unable to run simple spark-sql Hi Nirmal, i came across the following article "

Re: Unable to run simple spark-sql

2019-06-18 Thread Raymond Honderdors
019 5:56:06 PM > To: Raymond Honderdors; Nirmal Kumar > Cc: user > Subject: RE: Unable to run simple spark-sql > > Hi Raymond, > > Permission on hdfs is 777 > drwxrwxrwx - impadmin hdfs 0 2019-06-13 16:09 > /home/hive/spark-warehouse > > > But it’

Re: Unable to run simple spark-sql

2019-06-18 Thread Nirmal Kumar
for Android<https://aka.ms/ghei36> From: Nirmal Kumar Sent: Tuesday, June 18, 2019 5:56:06 PM To: Raymond Honderdors; Nirmal Kumar Cc: user Subject: RE: Unable to run simple spark-sql Hi Raymond, Permission on hdfs is 777 drwxrwxrwx - impadmin hdfs 0 2

RE: Unable to run simple spark-sql

2019-06-18 Thread Nirmal Kumar
-warehouse/testdb.db/employee_orc/.hive-staging_hive_2019-06-18_16-08-21_448_1691186175028734135-1' Thanks, -Nirmal From: Raymond Honderdors Sent: 18 June 2019 17:52 To: Nirmal Kumar Cc: user Subject: Re: Unable to run simple spark-sql Hi Can you check the permission of the user running spark O

Re: Unable to run simple spark-sql

2019-06-18 Thread Raymond Honderdors
ckManager - Put > block broadcast_0_piece0 locally took 4 ms > 16:08:21.323 [main] DEBUG org.apache.spark.storage.BlockManager - Putting > block broadcast_0_piece0 without replication took 4 ms > 16:08:21.326 [main] INFO org.apache.spark.SparkContext - Created broadcast > 0 from sql at SparkSQLTest.ja

Unable to run simple spark-sql

2019-06-18 Thread Nirmal Kumar
lock broadcast_0_piece0 locally took 4 ms 16:08:21.323 [main] DEBUG org.apache.spark.storage.BlockManager - Putting block broadcast_0_piece0 without replication took 4 ms 16:08:21.326 [main] INFO org.apache.spark.SparkContext - Created broadcast 0 from sql at SparkSQLTest.java:33 16:08:21.449 [main] DEBUG

Re: Fwd: [Spark SQL Thrift Server] Persistence errors with PostgreSQL and MySQL in 2.4.3

2019-06-11 Thread rmartine
Hi folks, Does anyone know what is happening in this case? I tried both with MySQL and PostgreSQL and none of them finishes schema creation without error. It seems something has changed from 2.2. to 2.4 that broke schema generation for Hive Metastore. -- Sent from:

Re: Spark SQL

2019-06-10 Thread Russell Spitzer
eam, > > Is Spark Sql uses hive engine to run queries ? > My understanding that spark sql uses hive meta store to get metadata > information to run queries. > > Thank you, > Naresh > -- > Thanks, > Naresh > www.linkedin.com/in/naresh-dulam > http://hadoopandspark.blogspot.com/ > >

Spark SQL

2019-06-10 Thread naresh Goud
Hi Team, Is Spark Sql uses hive engine to run queries ? My understanding that spark sql uses hive meta store to get metadata information to run queries. Thank you, Naresh -- Thanks, Naresh www.linkedin.com/in/naresh-dulam http://hadoopandspark.blogspot.com/

Re: Spark SQL in R?

2019-06-08 Thread Felix Cheung
. From: ya Sent: Friday, June 7, 2019 8:26:27 PM To: Rishikesh Gawade; felixcheun...@hotmail.com; user@spark.apache.org Subject: Spark SQL in R? Dear Felix and Richikesh and list, Thank you very much for your previous help. So far I have tried two ways to trigger

Spark SQL in R?

2019-06-07 Thread ya
Dear Felix and Richikesh and list, Thank you very much for your previous help. So far I have tried two ways to trigger Spark SQL: one is to use R with sparklyr library and SparkR library; the other way is to use SparkR shell from Spark. I am not connecting a remote spark cluster, but a local

[SQL] Why casting string column to timestamp gives null?

2019-06-07 Thread Jacek Laskowski
---+ |1970-01-01 01:00:01| |1970-01-01 01:00:02| +-------+ Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski The Internals of Spark SQL https://bit.ly/spark-sql-internals The Internals of Spark Structured Streaming https://bit.ly/spark-structured-streaming The Internals of Apache Kafka https://bit.ly/apache-kafka-internals Follow me at https://twitter.com/jaceklaskowski

Fwd: [Spark SQL Thrift Server] Persistence errors with PostgreSQL and MySQL in 2.4.3

2019-06-06 Thread Ricardo Martinelli de Oliveira
Hello, I'm running Thrift server with PostgresSQL persistence for hive metastore. I'm using Postgres 9.6 and spark 2.4.3 in this environment. When I start Thrift server I get lots of errors while creating the schema and it happen everytime I reach postgres, like: 19/06/06 15:51:59 WARN

Re: Spark Thriftserver on yarn, sql submit take long time.

2019-06-04 Thread Jun Zhu
case without explain, also take long time to submit 19/06/04 05:56:37 DEBUG SparkSQLOperationManager: Created Operation for > select count(*) from perf_as_reportads with > session=org.apache.hive.service.cli.session.HiveSessionImpl@1f30fc84, > runInBackground=true > 19/06/04 05:56:37 INFO

Spark Thriftserver on yarn, sql submit take long time.

2019-06-04 Thread Jun Zhu
Hi , Running thrift server on yarn. It's fast when beeline client send query to thrift server, but it take a while(about 90s) to submit to yarn cluster. >From Thrift server log: > *19/06/04 05:48:27* DEBUG SparkSQLOperationManager: Created Operation for > explain select count(*) from

Re: Does Spark SQL has match_recognize?

2019-05-29 Thread kant kodali
Nope Not at all On Sun, May 26, 2019 at 8:15 AM yeikel valdes wrote: > Isn't match_recognize just a filter? > > df.filter(predicate)? > > > On Sat, 25 May 2019 12:55:47 -0700 * kanth...@gmail.com > * wrote > > Hi All, > > Does Spark SQL has match

Re:Does Spark SQL has match_recognize?

2019-05-26 Thread yeikel valdes
Isn't match_recognize just a filter? df.filter(predicate)? On Sat, 25 May 2019 12:55:47 -0700 kanth...@gmail.com wrote Hi All, Does Spark SQL has match_recognize? I am not sure why CEP seems to be neglected I believe it is one of the most useful concepts in the Financial

Does Spark SQL has match_recognize?

2019-05-25 Thread kant kodali
Hi All, Does Spark SQL has match_recognize? I am not sure why CEP seems to be neglected I believe it is one of the most useful concepts in the Financial applications! Is there a plan to support it? Thanks!

Re: how to get spark-sql lineage

2019-05-16 Thread Arun Mahadevan
You can check out https://github.com/hortonworks-spark/spark-atlas-connector/ On Wed, 15 May 2019 at 19:44, lk_spark wrote: > hi,all: > When I use spark , if I run some SQL to do ETL how can I get > lineage info. I found that , CDH spark have some config abou

Re: how to get spark-sql lineage

2019-05-16 Thread Gabor Somogyi
Hi, spark.lineage.enabled is Cloudera specific and doesn't work with vanilla Spark. BR, G On Thu, May 16, 2019 at 4:44 AM lk_spark wrote: > hi,all: > When I use spark , if I run some SQL to do ETL how can I get > lineage info. I found that , CDH spark have some config abou

how to get spark-sql lineage

2019-05-15 Thread lk_spark
hi,all: When I use spark , if I run some SQL to do ETL how can I get lineage info. I found that , CDH spark have some config about lineage : spark.lineage.enabled=true spark.lineage.log.dir=/var/log/spark2/lineage Are they also work for apache spark ? 2019-05-16

Re: Spark sql insert hive table which method has the highest performance

2019-05-15 Thread Jelly Young
eq((3, 4)).toDF("j", "i").write.insertInto("t1") *scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1") *scala> sql("select * from t1").show *+---+---+ *| i| j| *+---+---+ *| 5| 6| *| 3| 4| *

Spark sql insert hive table which method has the highest performance

2019-05-15 Thread 车 ��
Hello guys, I use spark streaming to receive data from kafka and need to store the data into hive. I see the following ways to insert data into hive on the Internet: 1.use tmp_table TmpDF=spark.createDataFrame(RDD,schema) TmpDF.createOrReplaceTempView('TmpData')

IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff] while using spark-sql-2.4.1v to read data from oracle

2019-05-08 Thread Shyam P
Hi , I have oracle table in which has column schema is : DATA_DATE DATE something like 31-MAR-02 I am trying to retrieve data from oracle using spark-sql-2.4.1 version. I tried to set the JdbcOptions as below : .option("lowerBound", "2002-03-31 00:00:00"); .option

Re: Spark SQL met "Block broadcast_xxx not found"

2019-05-07 Thread Jacek Laskowski
Hi, I'm curious about "I found the bug code". Can you point me at it? Thanks. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Str

Re: Spark SQL met "Block broadcast_xxx not found"

2019-05-07 Thread Xilang Yan
Ok... I am sure it is a bug of spark, I found the bug code, but the code is removed in 2.2.3, so I just upgrade spark to fix the problem. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe

Re: Spark SQL JDBC teradata syntax error

2019-05-03 Thread Gourav Sengupta
> I was able to print schema if I give table name instead of sql query. > > I am getting below error if I give query(code snippet from above link). > any help is appreciated? > > Exception in thread "main" java.sql.SQLException: [Teradata Database] > [TeraJDBC 16.

Spark SQL JDBC teradata syntax error

2019-05-03 Thread KhajaAsmath Mohammed
Hi I have followed link https://community.teradata.com/t5/Connectivity/Teradata-JDBC-Driver-returns-the-wrong-schema-column-nullability/m-p/77824 to connect teradata from spark. I was able to print schema if I give table name instead of sql query. I am getting below error if I give query(code

Re: Spark SQL Teradata load is very slow

2019-05-03 Thread Shyam P
d at spark code side ? Thanks, Shyam On Thu, May 2, 2019 at 11:30 PM KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I have teradata table who has more than 2.5 billion records and data size > is around 600 GB. I am not able to pull efficiently using spark

Spark SQL Teradata load is very slow

2019-05-02 Thread KhajaAsmath Mohammed
Hi, I have teradata table who has more than 2.5 billion records and data size is around 600 GB. I am not able to pull efficiently using spark SQL and it is been running for more than 11 hours. here is my code. val df2 = sparkSession.read.format("jdbc") .option("url", &

<    1   2   3   4   5   6   7   8   9   10   >