Re: A question about radd bytes size

2019-12-01 Thread Wenchen Fan
When we talk about bytes size, we need to specify how the data is stored.
For example, if we cache the dataframe, then the bytes size is the number
of bytes of the binary format of the table cache. If we write to hive
tables, then the bytes size is the total size of the data files of the
table.

On Mon, Dec 2, 2019 at 1:06 PM zhangliyun  wrote:

> Hi:
>
>  I want to get the total bytes of a DataFrame by following function , but
> when I insert the DataFrame into hive , I found the value of the function
> is different from spark.sql.statistics.totalSize .  The
> spark.sql.statistics.totalSize  is less than the result of following
> function getRDDBytes .
>
>def getRDDBytes(df:DataFrame):Long={
>
>
>   df.rdd.getNumPartitions match {
> case 0 =>
>   0
> case numPartitions =>
>   val rddOfDataframe = 
> df.rdd.map(_.toString().getBytes("UTF-8").length.toLong)
>   val size = if (rddOfDataframe.isEmpty()) {
> 0
>   } else {
> rddOfDataframe.reduce(_ + _)
>   }
>
>   size
>   }
> }
> Appreciate if you can provide your suggestion.
>
> Best Regards
> Kelly Zhang
>
>
>
>
>


ScaledML 2020 Spark Speakers and Promo

2019-12-01 Thread Reza Zadeh
Spark Users,

You are all welcome to join us at ScaledML 2020: http://scaledml.org

A very steep discount is available for this list, using this link
.

We'd love to see you there.

Best,
Reza


connectivity

2019-12-01 Thread Krishna Chandran Nair
Hi Team,

Can anyone provide the sample code to connect to azure to connect to ADLS using 
azure key vault(user managed key).


Qatar Airways - Going Places Together



Disclaimer:- This message (including attachments) is intended solely for the 
addressee named above. It may be confidential, privileged, subject to 
copyright, trade secret, or other legal rules and may not be forwarded without 
the author's permission. If you are not the addressee you must not read, copy 
or disseminate this message. If you have received it in error please notify the 
sender immediately and delete the message from all storage devices. Any 
opinions expressed in this message do not necessarily represent the official 
positions of Qatar Airways. Any agreements (including any warranties, 
representations, or offers) concluded with Qatar Airways by using electronic 
correspondence shall only come into existence if an authorized representative 
of Qatar Airways has explicitly approved such contract formation. To the 
fullest extent permissible by law, Qatar Airways disclaim all liability for 
loss or damage to person or property arising from this message being infected 
by computer virus or other contamination.


A question about radd bytes size

2019-12-01 Thread zhangliyun
Hi:


 I want to get the total bytes of a DataFrame by following function , but when 
I insert the DataFrame into hive , I found the value of the function is 
different from spark.sql.statistics.totalSize .  The 
spark.sql.statistics.totalSize  is less than the result of following function 
getRDDBytes . 


   def getRDDBytes(df:DataFrame):Long={

  df.rdd.getNumPartitions match {
case 0 =>
0
case numPartitions =>
val rddOfDataframe = df.rdd.map(_.toString().getBytes("UTF-8").length.toLong)
val size = if (rddOfDataframe.isEmpty()) {
0
} else {
rddOfDataframe.reduce(_ + _)
  }

  size
  }

}
Appreciate if you can provide your suggestion.


Best Regards
Kelly Zhang



Re: [Spark SQL]: Does namespace name is always needed in a query for tables from a user defined catalog plugin

2019-12-01 Thread xufei
Thanks, Terry. Glad to know that it is not an expected behavior.

Terry Kim  于2019年12月2日周一 上午11:51写道:

> Hi Xufei,
> I also noticed the same while looking into relation resolution behavior
> (See Appendix A in this doc
> ).
> I created SPARK-30094  and
> will follow up.
>
> Thanks,
> Terry
>
> On Sun, Dec 1, 2019 at 7:12 PM xufei  wrote:
>
>> Hi,
>>
>> I'm trying to write a catalog plugin based on spark-3.0-preview,  and I
>> found even when I use 'use catalog.namespace' to set the current catalog
>> and namespace, I still need to qualified name in the query.
>>
>> For example, I add a catalog named 'example_catalog', there is a database
>> named 'test' in 'example_catalog', and a table 't' in
>> 'example_catalog.test'. I can query the table using 'select * from
>> example_catalog.test.t' under default catalog(which is spark_catalog).
>> After I use 'use example_catalog.test' to change the current catalog to
>> 'example_catalog', and the current namespace to 'test', I can query the
>> table using 'select * from test.t', but 'select * from t' failed due to
>> table_not_found exception.
>>
>> I want to know if this is an expected behavior?  If yes, it sounds a
>> little weird since I think after 'use example_catalog.test', all the
>> un-qualified identifiers should be interpreted as
>> 'example_catalog.test.identifier'.
>>
>> Attachment is a test file that you can use to reproduce the problem I met.
>>
>> Thanks.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [Spark SQL]: Does namespace name is always needed in a query for tables from a user defined catalog plugin

2019-12-01 Thread Terry Kim
Hi Xufei,
I also noticed the same while looking into relation resolution behavior
(See Appendix A in this doc
).
I created SPARK-30094  and
will follow up.

Thanks,
Terry

On Sun, Dec 1, 2019 at 7:12 PM xufei  wrote:

> Hi,
>
> I'm trying to write a catalog plugin based on spark-3.0-preview,  and I
> found even when I use 'use catalog.namespace' to set the current catalog
> and namespace, I still need to qualified name in the query.
>
> For example, I add a catalog named 'example_catalog', there is a database
> named 'test' in 'example_catalog', and a table 't' in
> 'example_catalog.test'. I can query the table using 'select * from
> example_catalog.test.t' under default catalog(which is spark_catalog).
> After I use 'use example_catalog.test' to change the current catalog to
> 'example_catalog', and the current namespace to 'test', I can query the
> table using 'select * from test.t', but 'select * from t' failed due to
> table_not_found exception.
>
> I want to know if this is an expected behavior?  If yes, it sounds a
> little weird since I think after 'use example_catalog.test', all the
> un-qualified identifiers should be interpreted as
> 'example_catalog.test.identifier'.
>
> Attachment is a test file that you can use to reproduce the problem I met.
>
> Thanks.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


[Spark SQL]: Does namespace name is always needed in a query for tables from a user defined catalog plugin

2019-12-01 Thread xufei
Hi,

I'm trying to write a catalog plugin based on spark-3.0-preview,  and I
found even when I use 'use catalog.namespace' to set the current catalog
and namespace, I still need to qualified name in the query.

For example, I add a catalog named 'example_catalog', there is a database
named 'test' in 'example_catalog', and a table 't' in
'example_catalog.test'. I can query the table using 'select * from
example_catalog.test.t' under default catalog(which is spark_catalog).
After I use 'use example_catalog.test' to change the current catalog to
'example_catalog', and the current namespace to 'test', I can query the
table using 'select * from test.t', but 'select * from t' failed due to
table_not_found exception.

I want to know if this is an expected behavior?  If yes, it sounds a little
weird since I think after 'use example_catalog.test', all the un-qualified
identifiers should be interpreted as 'example_catalog.test.identifier'.

Attachment is a test file that you can use to reproduce the problem I met.

Thanks.


DataSourceV2ExplainSuite.scala
Description: Binary data

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org