SQL GROUP BY alias with dots, was: Spark SQL question

2023-02-07 Thread Enrico Minack

Hi,

you are right, that is an interesting question.

Looks like GROUP BY is doing something funny / magic here (spark-shell 
3.3.1 and 3.5.0-SNAPSHOT):


With an alias, it behaves as you have pointed out:

spark.range(3).createTempView("ids_without_dots")
spark.sql("SELECT * FROM ids_without_dots").show()

// works
spark.sql("SELECT id AS `an.id` FROM ids_without_dots GROUP BY 
an.id").show()

// fails
spark.sql("SELECT id AS `an.id` FROM ids_without_dots GROUP BY 
`an.id`").show()



Without an alias, it behaves as expected, which is the opposite of above 
(a column with a dot exists, no alias used in SELECT):


spark.range(3).select($"id".as("an.id")).createTempView("ids_with_dots")
spark.sql("SELECT `an.id` FROM ids_with_dots").show()

// works
spark.sql("SELECT `an.id` FROM ids_with_dots GROUP BY `an.id`").show()
// fails
spark.sql("SELECT `an.id` FROM ids_with_dots GROUP BY an.id").show()


With a struct column, it also behaves as expected:

spark.range(3).select(struct($"id").as("an")).createTempView("ids_with_struct")
spark.sql("SELECT an.id FROM ids_with_struct").show()

// works
spark.sql("SELECT an.id FROM ids_with_struct GROUP BY an.id").show()
// fails
spark.sql("SELECT `an.id` FROM ids_with_struct GROUP BY an.id").show()
spark.sql("SELECT an.id FROM ids_with_struct GROUP BY `an.id`").show()
spark.sql("SELECT `an.id` FROM ids_with_struct GROUP BY `an.id`").show()


This does not feel very consistent.

Enrico



Am 28.01.23 um 00:34 schrieb Kohki Nishio:

this SQL works

select 1 as *`data.group`* from tbl group by *data.group*


Since there's no such field as *data,* I thought the SQL has to look 
like this


select 1 as *`data.group`* from tbl group by `*data.group`*


 But that gives and error (cannot resolve '`data.group`') ... I'm no 
expert in SQL, but feel like it's a strange behavior... does anybody 
have a good explanation for it ?


Thanks

--
Kohki Nishio




Re: Spark SQL question

2023-01-28 Thread Bjørn Jørgensen
Hi Mich.
This is a Spark user group mailing list where people can ask *any*
questions about spark.
You know SQL and streaming, but I don't think it's necessary to start a
replay with "*LOL*" to the question that's being asked.
No questions are to stupid to be asked.


lør. 28. jan. 2023 kl. 09:22 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> LOL
>
> First one
>
> spark-sql> select 1 as `data.group` from abc group by data.group;
> 1
> Time taken: 0.198 seconds, Fetched 1 row(s)
>
> means that are assigning alias data.group to select and you are using that
> alias -> data.group in your group by statement
>
>
> This is equivalent to
>
>
> spark-sql> select 1 as `data.group` from abc group by 1;
>
> 1
>
> With regard to your second sql
>
>
> select 1 as *`data.group`* from tbl group by `*data.group`;*
>
>
> *will throw an error *
>
>
> *spark-sql> select 1 as `data.group` from abc group by `data.group`;*
>
> *Error in query: cannot resolve '`data.group`' given input columns:
> [spark_catalog.elayer.abc.keyword, spark_catalog.elayer.abc.occurence];
> line 1 pos 43;*
>
> *'Aggregate ['`data.group`], [1 AS data.group#225]*
>
> *+- SubqueryAlias spark_catalog.elayer.abc*
>
> *   +- HiveTableRelation [`elayer`.`abc`,
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols:
> [keyword#226, occurence#227L], Partition Cols: []]*
>
> `data.group` with quotes is neither the name of the column or its alias
>
>
> *HTH*
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 27 Jan 2023 at 23:36, Kohki Nishio  wrote:
>
>> this SQL works
>>
>> select 1 as *`data.group`* from tbl group by *data.group*
>>
>>
>> Since there's no such field as *data,* I thought the SQL has to look
>> like this
>>
>> select 1 as *`data.group`* from tbl group by `*data.group`*
>>
>>
>>  But that gives and error (cannot resolve '`data.group`') ... I'm no
>> expert in SQL, but feel like it's a strange behavior... does anybody have a
>> good explanation for it ?
>>
>> Thanks
>>
>> --
>> Kohki Nishio
>>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Spark SQL question

2023-01-28 Thread Mich Talebzadeh
LOL

First one

spark-sql> select 1 as `data.group` from abc group by data.group;
1
Time taken: 0.198 seconds, Fetched 1 row(s)

means that are assigning alias data.group to select and you are using that
alias -> data.group in your group by statement


This is equivalent to


spark-sql> select 1 as `data.group` from abc group by 1;

1

With regard to your second sql


select 1 as *`data.group`* from tbl group by `*data.group`;*


*will throw an error *


*spark-sql> select 1 as `data.group` from abc group by `data.group`;*

*Error in query: cannot resolve '`data.group`' given input columns:
[spark_catalog.elayer.abc.keyword, spark_catalog.elayer.abc.occurence];
line 1 pos 43;*

*'Aggregate ['`data.group`], [1 AS data.group#225]*

*+- SubqueryAlias spark_catalog.elayer.abc*

*   +- HiveTableRelation [`elayer`.`abc`,
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols:
[keyword#226, occurence#227L], Partition Cols: []]*

`data.group` with quotes is neither the name of the column or its alias


*HTH*



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 27 Jan 2023 at 23:36, Kohki Nishio  wrote:

> this SQL works
>
> select 1 as *`data.group`* from tbl group by *data.group*
>
>
> Since there's no such field as *data,* I thought the SQL has to look like
> this
>
> select 1 as *`data.group`* from tbl group by `*data.group`*
>
>
>  But that gives and error (cannot resolve '`data.group`') ... I'm no
> expert in SQL, but feel like it's a strange behavior... does anybody have a
> good explanation for it ?
>
> Thanks
>
> --
> Kohki Nishio
>


Spark SQL question

2023-01-27 Thread Kohki Nishio
this SQL works

select 1 as *`data.group`* from tbl group by *data.group*


Since there's no such field as *data,* I thought the SQL has to look like
this

select 1 as *`data.group`* from tbl group by `*data.group`*


 But that gives and error (cannot resolve '`data.group`') ... I'm no expert
in SQL, but feel like it's a strange behavior... does anybody have a good
explanation for it ?

Thanks

-- 
Kohki Nishio


Re: Basic Spark SQL question

2015-07-14 Thread Ron Gonzalez
Cool thanks. Will take a look...

Sent from my iPhone

 On Jul 13, 2015, at 6:40 PM, Michael Armbrust mich...@databricks.com wrote:
 
 I'd look at the JDBC server (a long running yarn job you can submit queries 
 too)
 
 https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server
 
 On Mon, Jul 13, 2015 at 6:31 PM, Jerrick Hoang jerrickho...@gmail.com 
 wrote:
 Well for adhoc queries you can use the CLI
 
 On Mon, Jul 13, 2015 at 5:34 PM, Ron Gonzalez 
 zlgonza...@yahoo.com.invalid wrote:
 Hi,
   I have a question for Spark SQL. Is there a way to be able to use Spark 
 SQL on YARN without having to submit a job?
   Bottom line here is I want to be able to reduce the latency of running 
 queries as a job. I know that the spark sql default submission is like a 
 job, but was wondering if it's possible to run queries like one would with 
 a regular db like MySQL or Oracle.
 
 Thanks,
 Ron
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


Basic Spark SQL question

2015-07-13 Thread Ron Gonzalez

Hi,
  I have a question for Spark SQL. Is there a way to be able to use 
Spark SQL on YARN without having to submit a job?
  Bottom line here is I want to be able to reduce the latency of 
running queries as a job. I know that the spark sql default submission 
is like a job, but was wondering if it's possible to run queries like 
one would with a regular db like MySQL or Oracle.


Thanks,
Ron


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Basic Spark SQL question

2015-07-13 Thread Jerrick Hoang
Well for adhoc queries you can use the CLI

On Mon, Jul 13, 2015 at 5:34 PM, Ron Gonzalez zlgonza...@yahoo.com.invalid
wrote:

 Hi,
   I have a question for Spark SQL. Is there a way to be able to use Spark
 SQL on YARN without having to submit a job?
   Bottom line here is I want to be able to reduce the latency of running
 queries as a job. I know that the spark sql default submission is like a
 job, but was wondering if it's possible to run queries like one would with
 a regular db like MySQL or Oracle.

 Thanks,
 Ron


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Basic Spark SQL question

2015-07-13 Thread Michael Armbrust
I'd look at the JDBC server (a long running yarn job you can submit queries
too)

https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server

On Mon, Jul 13, 2015 at 6:31 PM, Jerrick Hoang jerrickho...@gmail.com
wrote:

 Well for adhoc queries you can use the CLI

 On Mon, Jul 13, 2015 at 5:34 PM, Ron Gonzalez 
 zlgonza...@yahoo.com.invalid wrote:

 Hi,
   I have a question for Spark SQL. Is there a way to be able to use Spark
 SQL on YARN without having to submit a job?
   Bottom line here is I want to be able to reduce the latency of running
 queries as a job. I know that the spark sql default submission is like a
 job, but was wondering if it's possible to run queries like one would with
 a regular db like MySQL or Oracle.

 Thanks,
 Ron


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Haopu Wang
Liquan, yes, for full outer join, one hash table on both sides is more 
efficient.

 

For the left/right outer join, it looks like one hash table should be enought.

 



From: Liquan Pei [mailto:liquan...@gmail.com] 
Sent: 2014年9月30日 18:34
To: Haopu Wang
Cc: d...@spark.apache.org; user
Subject: Re: Spark SQL question: why build hashtable for both sides in 
HashOuterJoin?

 

Hi Haopu,

 

How about full outer join? One hash table may not be efficient for this case. 

 

Liquan

 

On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang hw...@qilinsoft.com wrote:

Hi, Liquan, thanks for the response.

 

In your example, I think the hash table should be built on the right side, so 
Spark can iterate through the left side and find matches in the right side from 
the hash table efficiently. Please comment and suggest, thanks again!

 



From: Liquan Pei [mailto:liquan...@gmail.com] 
Sent: 2014年9月30日 12:31
To: Haopu Wang
Cc: d...@spark.apache.org; user
Subject: Re: Spark SQL question: why build hashtable for both sides in 
HashOuterJoin?

 

Hi Haopu,

 

My understanding is that the hashtable on both left and right side is used for 
including null values in result in an efficient manner. If hash table is only 
built on one side, let's say left side and we perform a left outer join, for 
each row in left side, a scan over the right side is needed to make sure that 
no matching tuples for that row on left side. 

 

Hope this helps!

Liquan

 

On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote:

I take a look at HashOuterJoin and it's building a Hashtable for both
sides.

This consumes quite a lot of memory when the partition is big. And it
doesn't reduce the iteration on streamed relation, right?

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





 

-- 
Liquan Pei 
Department of Physics 
University of Massachusetts Amherst 





 

-- 
Liquan Pei 
Department of Physics 
University of Massachusetts Amherst 



Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Matei Zaharia
I'm pretty sure inner joins on Spark SQL already build only one of the sides. 
Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer 
joins do both, and it seems like we could optimize it for those that are not 
full.

Matei


On Oct 7, 2014, at 11:04 PM, Haopu Wang hw...@qilinsoft.com wrote:

 Liquan, yes, for full outer join, one hash table on both sides is more 
 efficient.
  
 For the left/right outer join, it looks like one hash table should be enought.
  
 From: Liquan Pei [mailto:liquan...@gmail.com] 
 Sent: 2014年9月30日 18:34
 To: Haopu Wang
 Cc: d...@spark.apache.org; user
 Subject: Re: Spark SQL question: why build hashtable for both sides in 
 HashOuterJoin?
  
 Hi Haopu,
  
 How about full outer join? One hash table may not be efficient for this case. 
  
 Liquan
  
 On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang hw...@qilinsoft.com wrote:
 Hi, Liquan, thanks for the response.
  
 In your example, I think the hash table should be built on the right side, 
 so Spark can iterate through the left side and find matches in the right side 
 from the hash table efficiently. Please comment and suggest, thanks again!
  
 From: Liquan Pei [mailto:liquan...@gmail.com] 
 Sent: 2014年9月30日 12:31
 To: Haopu Wang
 Cc: d...@spark.apache.org; user
 Subject: Re: Spark SQL question: why build hashtable for both sides in 
 HashOuterJoin?
  
 Hi Haopu,
  
 My understanding is that the hashtable on both left and right side is used 
 for including null values in result in an efficient manner. If hash table is 
 only built on one side, let's say left side and we perform a left outer join, 
 for each row in left side, a scan over the right side is needed to make sure 
 that no matching tuples for that row on left side. 
  
 Hope this helps!
 Liquan
  
 On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote:
 I take a look at HashOuterJoin and it's building a Hashtable for both
 sides.
 
 This consumes quite a lot of memory when the partition is big. And it
 doesn't reduce the iteration on streamed relation, right?
 
 Thanks!
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
  
 -- 
 Liquan Pei 
 Department of Physics 
 University of Massachusetts Amherst
 
 
  
 -- 
 Liquan Pei 
 Department of Physics 
 University of Massachusetts Amherst



Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Liquan Pei
I am working on a PR to leverage the HashJoin trait code to optimize the
Left/Right outer join. It's already been tested locally and will send out
the PR soon after some clean up.

Thanks,
Liquan

On Wed, Oct 8, 2014 at 12:09 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 I'm pretty sure inner joins on Spark SQL already build only one of the
 sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators.
 Only outer joins do both, and it seems like we could optimize it for those
 that are not full.

 Matei



 On Oct 7, 2014, at 11:04 PM, Haopu Wang hw...@qilinsoft.com wrote:

 Liquan, yes, for full outer join, one hash table on both sides is more
 efficient.

 For the left/right outer join, it looks like one hash table should be
 enought.

 --
 *From:* Liquan Pei [mailto:liquan...@gmail.com liquan...@gmail.com]
 *Sent:* 2014年9月30日 18:34
 *To:* Haopu Wang
 *Cc:* d...@spark.apache.org; user
 *Subject:* Re: Spark SQL question: why build hashtable for both sides in
 HashOuterJoin?

 Hi Haopu,

 How about full outer join? One hash table may not be efficient for this
 case.

 Liquan

 On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang hw...@qilinsoft.com wrote:
 Hi, Liquan, thanks for the response.

 In your example, I think the hash table should be built on the right
 side, so Spark can iterate through the left side and find matches in the
 right side from the hash table efficiently. Please comment and suggest,
 thanks again!

 --
 *From:* Liquan Pei [mailto:liquan...@gmail.com]
 *Sent:* 2014年9月30日 12:31
 *To:* Haopu Wang
 *Cc:* d...@spark.apache.org; user
 *Subject:* Re: Spark SQL question: why build hashtable for both sides in
 HashOuterJoin?

 Hi Haopu,

 My understanding is that the hashtable on both left and right side is used
 for including null values in result in an efficient manner. If hash table
 is only built on one side, let's say left side and we perform a left outer
 join, for each row in left side, a scan over the right side is needed to
 make sure that no matching tuples for that row on left side.

 Hope this helps!
 Liquan

 On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote:

 I take a look at HashOuterJoin and it's building a Hashtable for both
 sides.

 This consumes quite a lot of memory when the partition is big. And it
 doesn't reduce the iteration on streamed relation, right?

 Thanks!

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst



 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst





-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst


RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-30 Thread Haopu Wang
Hi, Liquan, thanks for the response.

 

In your example, I think the hash table should be built on the right side, so 
Spark can iterate through the left side and find matches in the right side from 
the hash table efficiently. Please comment and suggest, thanks again!

 



From: Liquan Pei [mailto:liquan...@gmail.com] 
Sent: 2014年9月30日 12:31
To: Haopu Wang
Cc: d...@spark.apache.org; user
Subject: Re: Spark SQL question: why build hashtable for both sides in 
HashOuterJoin?

 

Hi Haopu,

 

My understanding is that the hashtable on both left and right side is used for 
including null values in result in an efficient manner. If hash table is only 
built on one side, let's say left side and we perform a left outer join, for 
each row in left side, a scan over the right side is needed to make sure that 
no matching tuples for that row on left side. 

 

Hope this helps!

Liquan

 

On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote:

I take a look at HashOuterJoin and it's building a Hashtable for both
sides.

This consumes quite a lot of memory when the partition is big. And it
doesn't reduce the iteration on streamed relation, right?

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





 

-- 
Liquan Pei 
Department of Physics 
University of Massachusetts Amherst 



Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-30 Thread Liquan Pei
Hi Haopu,

How about full outer join? One hash table may not be efficient for this
case.

Liquan

On Mon, Sep 29, 2014 at 11:47 PM, Haopu Wang hw...@qilinsoft.com wrote:

Hi, Liquan, thanks for the response.



 In your example, I think the hash table should be built on the right
 side, so Spark can iterate through the left side and find matches in the
 right side from the hash table efficiently. Please comment and suggest,
 thanks again!


  --

 *From:* Liquan Pei [mailto:liquan...@gmail.com]
 *Sent:* 2014年9月30日 12:31
 *To:* Haopu Wang
 *Cc:* d...@spark.apache.org; user
 *Subject:* Re: Spark SQL question: why build hashtable for both sides in
 HashOuterJoin?



 Hi Haopu,



 My understanding is that the hashtable on both left and right side is used
 for including null values in result in an efficient manner. If hash table
 is only built on one side, let's say left side and we perform a left outer
 join, for each row in left side, a scan over the right side is needed to
 make sure that no matching tuples for that row on left side.



 Hope this helps!

 Liquan



 On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote:

 I take a look at HashOuterJoin and it's building a Hashtable for both
 sides.

 This consumes quite a lot of memory when the partition is big. And it
 doesn't reduce the iteration on streamed relation, right?

 Thanks!

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





 --
 Liquan Pei
 Department of Physics
 University of Massachusetts Amherst




-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst


Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Haopu Wang
I take a look at HashOuterJoin and it's building a Hashtable for both
sides.

This consumes quite a lot of memory when the partition is big. And it
doesn't reduce the iteration on streamed relation, right?

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Liquan Pei
Hi Haopu,

My understanding is that the hashtable on both left and right side is used
for including null values in result in an efficient manner. If hash table
is only built on one side, let's say left side and we perform a left outer
join, for each row in left side, a scan over the right side is needed to
make sure that no matching tuples for that row on left side.

Hope this helps!
Liquan

On Mon, Sep 29, 2014 at 8:36 PM, Haopu Wang hw...@qilinsoft.com wrote:

 I take a look at HashOuterJoin and it's building a Hashtable for both
 sides.

 This consumes quite a lot of memory when the partition is big. And it
 doesn't reduce the iteration on streamed relation, right?

 Thanks!

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst


Spark SQL question: how to control the storage level of cached SchemaRDD?

2014-09-28 Thread Haopu Wang
Thanks for the response. From Spark Web-UI's Storage tab, I do see cached RDD 
there.

 

But the storage level is Memory Deserialized 1x Replicated. How can I change 
the storage level? Because I have a big table there.

 

Thanks!

 



From: Cheng Lian [mailto:lian.cs@gmail.com] 
Sent: 2014年9月26日 21:24
To: Haopu Wang; user@spark.apache.org
Subject: Re: Spark SQL question: is cached SchemaRDD storage controlled by 
spark.storage.memoryFraction?

 

Yes it is. The in-memory storage used with SchemaRDD also uses RDD.cache() 
under the hood.

On 9/26/14 4:04 PM, Haopu Wang wrote:

Hi, I'm querying a big table using Spark SQL. I see very long GC time in
some stages. I wonder if I can improve it by tuning the storage
parameter.
 
The question is: the schemaRDD has been cached with cacheTable()
function. So is the cached schemaRDD part of memory storage controlled
by the spark.storage.memoryFraction parameter?
 
Thanks!
 
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
 

​



Re: Spark SQL question: how to control the storage level of cached SchemaRDD?

2014-09-28 Thread Michael Armbrust
This is not possible until https://github.com/apache/spark/pull/2501 is
merged.

On Sun, Sep 28, 2014 at 6:39 PM, Haopu Wang hw...@qilinsoft.com wrote:

   Thanks for the response. From Spark Web-UI's Storage tab, I do see
 cached RDD there.



 But the storage level is Memory Deserialized 1x Replicated. How can I
 change the storage level? Because I have a big table there.



 Thanks!


  --

 *From:* Cheng Lian [mailto:lian.cs@gmail.com]
 *Sent:* 2014年9月26日 21:24
 *To:* Haopu Wang; user@spark.apache.org
 *Subject:* Re: Spark SQL question: is cached SchemaRDD storage controlled
 by spark.storage.memoryFraction?



 Yes it is. The in-memory storage used with SchemaRDD also uses RDD.cache()
 under the hood.

 On 9/26/14 4:04 PM, Haopu Wang wrote:

 Hi, I'm querying a big table using Spark SQL. I see very long GC time in

 some stages. I wonder if I can improve it by tuning the storage

 parameter.



 The question is: the schemaRDD has been cached with cacheTable()

 function. So is the cached schemaRDD part of memory storage controlled

 by the spark.storage.memoryFraction parameter?



 Thanks!



 -

 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

 For additional commands, e-mail: user-h...@spark.apache.org



  ​



Re: Spark SQL question: how to control the storage level of cached SchemaRDD?

2014-09-28 Thread Michael Armbrust
You might consider instead storing the data using saveAsParquetFile and
then querying that after running
sqlContext.parquetFile(...).registerTempTable(...).

On Sun, Sep 28, 2014 at 6:43 PM, Michael Armbrust mich...@databricks.com
wrote:

 This is not possible until https://github.com/apache/spark/pull/2501 is
 merged.

 On Sun, Sep 28, 2014 at 6:39 PM, Haopu Wang hw...@qilinsoft.com wrote:

   Thanks for the response. From Spark Web-UI's Storage tab, I do see
 cached RDD there.



 But the storage level is Memory Deserialized 1x Replicated. How can I
 change the storage level? Because I have a big table there.



 Thanks!


  --

 *From:* Cheng Lian [mailto:lian.cs@gmail.com]
 *Sent:* 2014年9月26日 21:24
 *To:* Haopu Wang; user@spark.apache.org
 *Subject:* Re: Spark SQL question: is cached SchemaRDD storage
 controlled by spark.storage.memoryFraction?



 Yes it is. The in-memory storage used with SchemaRDD also uses
 RDD.cache() under the hood.

 On 9/26/14 4:04 PM, Haopu Wang wrote:

 Hi, I'm querying a big table using Spark SQL. I see very long GC time in

 some stages. I wonder if I can improve it by tuning the storage

 parameter.



 The question is: the schemaRDD has been cached with cacheTable()

 function. So is the cached schemaRDD part of memory storage controlled

 by the spark.storage.memoryFraction parameter?



 Thanks!



 -

 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

 For additional commands, e-mail: user-h...@spark.apache.org



  ​





Spark SQL question: is cached SchemaRDD storage controlled by spark.storage.memoryFraction?

2014-09-26 Thread Haopu Wang
Hi, I'm querying a big table using Spark SQL. I see very long GC time in
some stages. I wonder if I can improve it by tuning the storage
parameter.

The question is: the schemaRDD has been cached with cacheTable()
function. So is the cached schemaRDD part of memory storage controlled
by the spark.storage.memoryFraction parameter?

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL question: is cached SchemaRDD storage controlled by spark.storage.memoryFraction?

2014-09-26 Thread Cheng Lian
Yes it is. The in-memory storage used with |SchemaRDD| also uses 
|RDD.cache()| under the hood.


On 9/26/14 4:04 PM, Haopu Wang wrote:


Hi, I'm querying a big table using Spark SQL. I see very long GC time in
some stages. I wonder if I can improve it by tuning the storage
parameter.

The question is: the schemaRDD has been cached with cacheTable()
function. So is the cached schemaRDD part of memory storage controlled
by the spark.storage.memoryFraction parameter?

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


​


Fwd: Spark SQL question: is cached SchemaRDD storage controlled by spark.storage.memoryFraction?

2014-09-26 Thread Liquan Pei
-- Forwarded message --
From: Liquan Pei liquan...@gmail.com
Date: Fri, Sep 26, 2014 at 1:33 AM
Subject: Re: Spark SQL question: is cached SchemaRDD storage controlled by
spark.storage.memoryFraction?
To: Haopu Wang hw...@qilinsoft.com


Hi Haopu,

Internally, cactheTable on a schemaRDD is implemented as a cache() on a
MapPartitionsRDD. As memory reserved for caching RDDs is controlled by
spark.storage.memoryFraction,
memory storage of cached schemaRDD is controlled by
spark.storage.memoryFraction.

Hope this helps!
Liquan

On Fri, Sep 26, 2014 at 1:04 AM, Haopu Wang hw...@qilinsoft.com wrote:

 Hi, I'm querying a big table using Spark SQL. I see very long GC time in
 some stages. I wonder if I can improve it by tuning the storage
 parameter.

 The question is: the schemaRDD has been cached with cacheTable()
 function. So is the cached schemaRDD part of memory storage controlled
 by the spark.storage.memoryFraction parameter?

 Thanks!

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst



-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst