unsubscribe

2019-07-12 Thread Conor Begley
unsubscribe

-- 

*Conor Begley*   |   *QUERY*CLICK
*Software Developer Intern*

e: conor.beg...@queryclick.com <%23#--email--%23...@queryclick.com>
dd: +44 (0)131 516 5251
ed: +44 (0)131 556 7078 // lon: +44 (0)207 183 0344

[image: QueryClick is a Healthy Working Lives Bronze Award Winner.]

Feed your online marketing muse & join us on Google+
, Twitter
, LinkedIn
 and Facebook
.

QueryClick Ltd (c/o reg: SC342868 & VAT GB 935518900) is registered in
Scotland with offices at The Stamp Office, 10 Waterloo Place, Edinburgh,
EH1 3EG & 1 Mark Square, London, EC2A 4EG.

Advice or opinion voiced in this correspondence are those of the author and
may not represent those of QueryClick Ltd. Information in this email and
any attachments are confidential and intended for the named recipients only.

Performance, improved. http://uk.queryclick.com/


Re: Problems running TPC-H on Raspberry Pi Cluster

2019-07-12 Thread agg212
Good to know. Will look into the Raspberry Pi 4 (w/4GB RAM). 

In general, are there any tuning or configuration tips/tricks for very
memory-constrained deployments (e.g., 1-4GB RAM)?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Reynold Xin
No sorry I'm not at liberty to share other people's code.

On Fri, Jul 12, 2019 at 9:33 AM, Gourav Sengupta < gourav.sengu...@gmail.com > 
wrote:

> 
> Hi Reynold,
> 
> 
> I am genuinely curious about queries which are more than 1 MB and am
> stunned by tens of MB's. Any samples to share :) 
> 
> 
> Regards,
> Gourav
> 
> On Thu, Jul 11, 2019 at 5:03 PM Reynold Xin < rxin@ databricks. com (
> r...@databricks.com ) > wrote:
> 
> 
>> There is no explicit limit but a JVM string cannot be bigger than 2G. It
>> will also at some point run out of memory with too big of a query plan
>> tree or become incredibly slow due to query planning complexity. I've seen
>> queries that are tens of MBs in size.
>> 
>> 
>> 
>> 
>> 
>> 
>> On Thu, Jul 11, 2019 at 5:01 AM, 李书明 < alemmontree@ 126. com (
>> alemmont...@126.com ) > wrote:
>> 
>>> I have a question about the limit(biggest) of SQL's length that is
>>> supported in SparkSQL. I can't find the answer in the documents of Spark.
>>> 
>>> 
>>> Maybe Interger.MAX_VALUE or not ?
>>> 
>> 
>> 
> 
>

Re: Help: What's the biggest length of SQL that's supported in SparkSQL?

2019-07-12 Thread Gourav Sengupta
Hi Reynold,

I am genuinely curious about queries which are more than 1 MB and am
stunned by tens of MB's. Any samples to share :)

Regards,
Gourav

On Thu, Jul 11, 2019 at 5:03 PM Reynold Xin  wrote:

> There is no explicit limit but a JVM string cannot be bigger than 2G. It
> will also at some point run out of memory with too big of a query plan tree
> or become incredibly slow due to query planning complexity. I've seen
> queries that are tens of MBs in size.
>
>
>
> On Thu, Jul 11, 2019 at 5:01 AM, 李书明  wrote:
>
>> I have a question about the limit(biggest) of SQL's length that is
>> supported in SparkSQL. I can't find the answer in the documents of Spark.
>>
>> Maybe Interger.MAX_VALUE or not ?
>>
>>
>


Partition pruning by IDs from another table

2019-07-12 Thread Tomas Bartalos
Hello,
I have 2 parquet tables:
stored - table of 10 M records
data - table of 100K records

*This is fast:*
val dataW = data.where("registration_ts in (20190516204l,
20190515143l,20190510125l, 20190503151l)")
dataW.count
res44: Long = 42
//takes 3 seconds
stored.join(broadcast(dataW), Seq("registration_ts"), "leftsemi").collect

*Similar but its slow:*
val dataW = data.limit(10).select("registration_ts").distinct
dataW.count
res45: Long = 1
//takes 2 minutes
stored.join(broadcast(dataW), Seq("registration_ts"), "leftsemi").collect
[Stage 181:>  (0 + 1) /
373]

The reason is that the first query propagates PartitionFilters up to joined
"stored" table:
... PartitionFilters: [registration_ts#1635L IN
(20190516204,20190515143,20190510125,20190503151)
And the second one is not:
PartitionFilters: []

For low number of IDs its more effective to collect them to driver and
issue a 2-nd query with partition filter, but there have to be a better
way...
How can I achieve effective partition pruning when using IDs from other
table ?

Following SQL have same query plan and same behavior:
spark.sql("select * from stored where exists (select 1 from dataW where
dataW.registration_ts = stored.registration_ts)")

Thank you,
Tomas


Re: Release Apache Spark 2.4.4 before 3.0.0

2019-07-12 Thread Dongjoon Hyun
Thank you, Jacek.

BTW, I added `@private` since we need PMC's help to make an Apache Spark
release.

Can I get more feedbacks from the other PMC members?

Please me know if you have any concerns (e.g. Release date or Release
manager?)

As one of the community members, I assumed the followings (if we are on
schedule).

- 2.4.4 at the end of July
- 2.3.4 at the end of August (since 2.3.0 was released at the end of
February 2018)
- 3.0.0 (possibily September?)
- 3.1.0 (January 2020?)

Bests,
Dongjoon.


On Thu, Jul 11, 2019 at 1:30 PM Jacek Laskowski  wrote:

> Hi,
>
> Thanks Dongjoon Hyun for stepping up as a release manager!
> Much appreciated.
>
> If there's a volunteer to cut a release, I'm always to support it.
>
> In addition, the more frequent releases the better for end users so they
> have a choice to upgrade and have all the latest fixes or wait. It's their
> call not ours (when we'd keep them waiting).
>
> My big 2 yes'es for the release!
>
> Jacek
>
>
> On Tue, 9 Jul 2019, 18:15 Dongjoon Hyun,  wrote:
>
>> Hi, All.
>>
>> Spark 2.4.3 was released two months ago (8th May).
>>
>> As of today (9th July), there exist 45 fixes in `branch-2.4` including
>> the following correctness or blocker issues.
>>
>> - SPARK-26038 Decimal toScalaBigInt/toJavaBigInteger not work for
>> decimals not fitting in long
>> - SPARK-26045 Error in the spark 2.4 release package with the
>> spark-avro_2.11 dependency
>> - SPARK-27798 from_avro can modify variables in other rows in local
>> mode
>> - SPARK-27907 HiveUDAF should return NULL in case of 0 rows
>> - SPARK-28157 Make SHS clear KVStore LogInfo for the blacklist entries
>> - SPARK-28308 CalendarInterval sub-second part should be padded
>> before parsing
>>
>> It would be great if we can have Spark 2.4.4 before we are going to get
>> busier for 3.0.0.
>> If it's okay, I'd like to volunteer for an 2.4.4 release manager to roll
>> it next Monday. (15th July).
>> How do you think about this?
>>
>> Bests,
>> Dongjoon.
>>
>