Re: A question about radd bytes size

2019-12-01 Thread Wenchen Fan
When we talk about bytes size, we need to specify how the data is stored.
For example, if we cache the dataframe, then the bytes size is the number
of bytes of the binary format of the table cache. If we write to hive
tables, then the bytes size is the total size of the data files of the
table.

On Mon, Dec 2, 2019 at 1:06 PM zhangliyun  wrote:

> Hi:
>
>  I want to get the total bytes of a DataFrame by following function , but
> when I insert the DataFrame into hive , I found the value of the function
> is different from spark.sql.statistics.totalSize .  The
> spark.sql.statistics.totalSize  is less than the result of following
> function getRDDBytes .
>
>def getRDDBytes(df:DataFrame):Long={
>
>
>   df.rdd.getNumPartitions match {
> case 0 =>
>   0
> case numPartitions =>
>   val rddOfDataframe = 
> df.rdd.map(_.toString().getBytes("UTF-8").length.toLong)
>   val size = if (rddOfDataframe.isEmpty()) {
> 0
>   } else {
> rddOfDataframe.reduce(_ + _)
>   }
>
>   size
>   }
> }
> Appreciate if you can provide your suggestion.
>
> Best Regards
> Kelly Zhang
>
>
>
>
>


Subscribe

2019-12-01 Thread CharSyam



Spark 2.4.5 release?

2019-12-01 Thread jm
Hi all,

Is there any desire to prepare a 2.4.5 release?  It’s been 3 months since
2.4.4 was released and there have been quite a few bug fixes since then (the
k8s client upgrade is the one I'm interested in hence the question).

Cheers!
Jason.



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



A question about radd bytes size

2019-12-01 Thread zhangliyun
Hi:


 I want to get the total bytes of a DataFrame by following function , but when 
I insert the DataFrame into hive , I found the value of the function is 
different from spark.sql.statistics.totalSize .  The 
spark.sql.statistics.totalSize  is less than the result of following function 
getRDDBytes . 


   def getRDDBytes(df:DataFrame):Long={

  df.rdd.getNumPartitions match {
case 0 =>
0
case numPartitions =>
val rddOfDataframe = df.rdd.map(_.toString().getBytes("UTF-8").length.toLong)
val size = if (rddOfDataframe.isEmpty()) {
0
} else {
rddOfDataframe.reduce(_ + _)
  }

  size
  }

}
Appreciate if you can provide your suggestion.


Best Regards
Kelly Zhang



[DISCUSS] Consistent relation resolution behavior in SparkSQL

2019-12-01 Thread Terry Kim
Hi all,

As discussed in SPARK-29900, Spark currently has two different relation
resolution behaviors:

   1. Look up temp view first, then table/persistent view
   2. Look up table/persistent view

The first behavior is used in SELECT, INSERT and a few commands that
support temp views such as DESCRIBE TABLE, etc. The second behavior is used
in most commands. Thus, it is hard to predict which relation resolution
rule is being applied for a given command.

I want to propose a consistent relation resolution behavior in which temp
views are always looked up first before table/persistent view, as
described more in detail in this doc: consistent relation resolution
proposal

.

Note that this proposal is a breaking change, but the impact should be
minimal since this applies only when there are temp views and tables with
the same name.

Any feedback will be appreciated.

I also want to thank Wenchen Fan, Ryan Blue, Burak Yavuz, and Dongjoon Hyun
for guidance and suggestion.

Regards,
Terry





Status of Scala 2.13 support

2019-12-01 Thread Sean Owen
As you can see, I've been working on Scala 2.13 support. The umbrella
is https://issues.apache.org/jira/browse/SPARK-25075 I wanted to lay
out status and strategy.

This will not be done for 3.0. At the least, there are a few key
dependencies (Chill, Kafka) that aren't published for 2.13, and at
least one change that will need removing an API deprecated as of 3.0.
Realistically: maybe Spark 3.1. I don't yet think it's pressing.


Making the change is difficult as it's hard to understand the extent
of the necessary changes until the whole thing minimally compiles for
2.13. I have gotten essentially that far in a local clone. The good
news is I don't see any obvious hard blockers, but the changes add up
to thousands of line in 200+ files.


What do we need to do for 3.0? any changes that entail breaking a
public API, ideally. The biggest issue there comes from extensive
changes to the Scala collection hierarchy mean that the types of many
public APIs that return a Seq, Map, TraversableOnce, etc _will_
actually change types in 2.13 (become immutable). See:
https://issues.apache.org/jira/browse/SPARK-27683 and
https://issues.apache.org/jira/browse/SPARK-29292 as the main
examples.

In both cases, keeping the exact same public type would require much
bigger changes. These are the type of changes that all applications
face when migrating to 2.13 though. 2.12 and 2.13 apps were never
meant to be binary-compatible. So, in both cases we're not changing
these, to avoid a lot of change and parallel source trees.

I _think_ we're done with any other must-do changes for 3.0, therefore.


What _can_ we do for 3.0? small changes that don't affect the 2.12
build are OK, and that's what you see in pull requests going in at the
moment. The big question is whether we want to do the large change for
https://issues.apache.org/jira/browse/SPARK-29292 before 3.0. It will
mean adding a ton of ".toSeq" and ".toMap" calls to make mutable
collections immutable when passed to methods. In theory, it won't
affect behavior. We'll have to see if it does in practice.

The rest will have to wait until after 3.0, I believe, including even
testing the 2.13 build, which will probably turn up some more issues.


Thoughts on approach?

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] PostgreSQL dialect

2019-12-01 Thread Driesprong, Fokko
+1 (non-binding)

Cheers, Fokko

Op do 28 nov. 2019 om 03:47 schreef Dongjoon Hyun :

> +1
>
> Bests,
> Dongjoon.
>
> On Tue, Nov 26, 2019 at 3:52 PM Takeshi Yamamuro 
> wrote:
>
>> Yea, +1, that looks pretty reasonable to me.
>> > Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>> from the codebase before it's too late. Curently we only have 3 features
>> under PostgreSQL dialect:
>> I personally think we could at least stop work about the Dialect until
>> 3.0 released.
>>
>>
>> On Wed, Nov 27, 2019 at 2:41 AM Gengliang Wang <
>> gengliang.w...@databricks.com> wrote:
>>
>>> +1 with the practical proposal.
>>> To me, the major concern is that the code base becomes complicated,
>>> while the PostgreSQL dialect has very limited features. I tried introducing
>>> one big flag `spark.sql.dialect` and isolating related code in #25697
>>> , but it seems hard to be
>>> clean.
>>> Furthermore, the PostgreSQL dialect configuration overlaps with the ANSI
>>> mode, which can be confusing sometimes.
>>>
>>> Gengliang
>>>
>>> On Tue, Nov 26, 2019 at 8:57 AM Xiao Li  wrote:
>>>
 +1


> One particular negative effect has been that new postgresql tests add
> well over an hour to tests,


 Adding postgresql tests is for improving the test coverage of Spark
 SQL. We should continue to do this by importing more test cases. The
 quality of Spark highly depends on the test coverage. We can further
 paralyze the test execution to reduce the test time.

 Migrating PostgreSQL workloads to Spark SQL


 This should not be our current focus. In the near future, it is
 impossible to be fully compatible with PostgreSQL. We should focus on
 adding features that are useful to Spark community. PostgreSQL is a good
 reference, but we do not need to blindly follow it. We already closed
 multiple related JIRAs that try to add some PostgreSQL features that are
 not commonly used.

 Cheers,

 Xiao


 On Tue, Nov 26, 2019 at 8:30 AM Maciej Szymkiewicz <
 mszymkiew...@gmail.com> wrote:

> I think it is important to distinguish between two different concepts:
>
>- Adherence to standards and their well established
>implementations.
>- Enabling migrations from some product X to Spark.
>
> While these two problems are related, there are independent and one
> can be achieved without the other.
>
>- The former approach doesn't imply that all features of SQL
>standard (or its specific implementation) are provided. It is 
> sufficient
>that commonly used features that are implemented, are standard 
> compliant.
>Therefore if end user applies some well known pattern, thing will work 
> as
>expected. I
>
>In my personal opinion that's something that is worth the required
>development resources, and in general should happen within the project.
>
>
>- The latter one is more complicated. First of all the premise
>that one can "migrate PostgreSQL workloads to Spark" seems to be 
> flawed.
>While both Spark and PostgreSQL evolve, and probably have more in 
> common
>today, than a few years ago, they're not even close enough to pretend 
> that
>one can be replacement for the other. In contrast, existing 
> compatibility
>layers between major vendors make sense, because feature disparity (at
>least when it comes to core functionality) is usually minimal. And that
>doesn't even touch the problem that PostgreSQL provides extensively 
> used
>extension points that enable broad and evolving ecosystem (what should 
> we
>do about continuous queries? Should Structured Streaming provide some
>compatibility layer as well?).
>
>More realistically Spark could provide a compatibility layer with
>some analytical tools that itself provide some PostgreSQL 
> compatibility,
>but these are not always fully compatible with upstream PostgreSQL, nor
>necessarily follow the latest PostgreSQL development.
>
>Furthermore compatibility layer can be, within certain limits
>(i.e. availability of required primitives), maintained as a separate
>project, without putting more strain on existing resources. Effectively
>what we care about here is if we can translate certain SQL string into
>logical or physical plan.
>
>
> On 11/26/19 3:26 PM, Wenchen Fan wrote:
>
> Hi all,
>
> Recently we start an effort to achieve feature parity between Spark
> and PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>
> This goes very well. We've added many missing features(parser rules,
> built-in functions, etc.) to Spark, and also corrected several

Re: override collect_list

2019-12-01 Thread Driesprong, Fokko
Hi Abhnav,

this sounds to me like a bad design, since it isn't scalable. Would it be
possible to store all the data in a database like hbase/bigtable/cassandra?
This would allow you to write the data from all the workers in parallel to
the database/

Cheers, Fokko

Op wo 27 nov. 2019 om 06:58 schreef Ranjan, Abhinav <
abhinav.ranjan...@gmail.com>:

> Hi all,
>
> I want to collect some rows in a list by using the spark's collect_list
> function.
>
> However, the no. of rows getting in the list is overflowing the memory. Is
> there any way to force the collection of rows onto the disk rather than in
> memory, or else instead of collecting it as a list, collect it as a list of
> list so as to avoid collecting it whole into the memory.
>
> *ex: df as:*
>
> *idcol1col2*
>
> 1assd
>
> 1dffg
>
> 1ghjk
>
> 2rtty
>
> *df.groupBy(id).agg(collect_list(struct(col1, col2) as col3)))*
>
> *idcol3*
>
> 1[(as,sd),(df,fg),(gh,jk)]
>
> 2[(rt,ty)]
>
>
> so if id=1 is having too much rows than the list will overflow. How to
> avoid this scenario?
>
>
> Thanks,
>
> Abhnav
>
>
>