Re: performance of IN clause

2018-10-17 Thread Silvio Fiorito
Have you run explain for each query? If you look at the physical query plan 
it’s most likely the same. If the inner-query/join-table is small enough it 
should end up as a broadcast join.

From: Jayesh Lalwani 
Date: Wednesday, October 17, 2018 at 5:03 PM
To: "user@spark.apache.org" 
Subject: performance of IN clause

Is there  a significant differrence in how a IN clause performs when compared 
to a JOIN?

Let's say I have 2 tables, A and B/ B has 50million rows and A has 1 million

Will this query?
Select * from A where join_key in (Select join_key from B)
perform much worse than
 Select * from A
INNER join on A.join_key = B.join_key

Will the first query always trigger a broadcast of B?



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


performance of IN clause

2018-10-17 Thread Jayesh Lalwani
Is there  a significant differrence in how a IN clause performs when
compared to a JOIN?

Let's say I have 2 tables, A and B/ B has 50million rows and A has 1 million

Will this query?
*Select * from A where join_key in (Select join_key from B)*
*perform much worse than*
* Select * from A*
*INNER join on A.join_key = B.join_key*

Will the first query always trigger a broadcast of B?


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


[PySpark SQL]: SparkConf does not exist in the JVM

2018-10-17 Thread takao
Hi,

`pyspark.sql.SparkSession.builder.getOrCreate()` gives me an error, and I
wonder if anyone can help me with this.

The line of code that gives me an error is

```
with spark_session(master, app_name) as session:
```

where spark_session is Python's context manager:

```
@contextlib.contextmanager
def spark_session(master, app_name):
session = pyspark.sql.SparkSession.builder\
.master(master).appName(app_name)\
.config("spark.executorEnv.PYTHONPATH", os.getenv("PYTHONPATH"))\
.getOrCreate()
try:
yield session
finally:
session.stop()
```

The error message is

```
/usr/local/lib/python3.6/site-packages/pyspark/sql/session.py:170: in
getOrCreate
sparkConf = SparkConf()
/usr/local/lib/python3.6/site-packages/pyspark/conf.py:116: in __init__
self._jconf = _jvm.SparkConf(loadDefaults)
py4j.protocol.Py4JError: SparkConf does not exist in the JVM
```

I am using Spark local cluster, and Spark's version is 2.3.2. PySpark is
also version 2.3.2.

Thanks in advance.
Takao



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark In Memory Shuffle

2018-10-17 Thread ☼ R Nair
What are the steps to configure this? Thanks

On Wed, Oct 17, 2018, 9:39 AM onmstester onmstester
 wrote:

> Hi,
> I failed to config spark for in-memory shuffle so currently just
> using linux memory mapped directory (tmpfs) as working directory of spark,
> so everything is fast
>
> Sent using Zoho Mail 
>
>
>  On Wed, 17 Oct 2018 16:41:32 +0330 *thomas lavocat
>  >* wrote 
>
> Hi everyone,
>
>
> The possibility to have in memory shuffling is discussed in this issue
> https://github.com/apache/spark/pull/5403. It was in 2015.
>
> In 2016 the paper "Scaling Spark on HPC Systems" says that Spark still
> shuffle using disks. I would like to know :
>
>
> What is the current state of in memory shuffling ?
>
> Is it implemented in production ?
>
> Does the current shuffle still use disks to work ?
>
> Is it possible to somehow do it in RAM only ?
>
>
> Regards,
>
> Thomas
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>


Re: Spark In Memory Shuffle

2018-10-17 Thread Gourav Sengupta
super duper, I also need to try this out.

On Wed, Oct 17, 2018 at 2:39 PM onmstester onmstester
 wrote:

> Hi,
> I failed to config spark for in-memory shuffle so currently just
> using linux memory mapped directory (tmpfs) as working directory of spark,
> so everything is fast
>
> Sent using Zoho Mail 
>
>
>  On Wed, 17 Oct 2018 16:41:32 +0330 *thomas lavocat
>  >* wrote 
>
> Hi everyone,
>
>
> The possibility to have in memory shuffling is discussed in this issue
> https://github.com/apache/spark/pull/5403. It was in 2015.
>
> In 2016 the paper "Scaling Spark on HPC Systems" says that Spark still
> shuffle using disks. I would like to know :
>
>
> What is the current state of in memory shuffling ?
>
> Is it implemented in production ?
>
> Does the current shuffle still use disks to work ?
>
> Is it possible to somehow do it in RAM only ?
>
>
> Regards,
>
> Thomas
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>


Re: Spark In Memory Shuffle

2018-10-17 Thread onmstester onmstester
Hi, I failed to config spark for in-memory shuffle so currently just using 
linux memory mapped directory (tmpfs) as working directory of spark, so 
everything is fast Sent using Zoho Mail  On Wed, 17 Oct 2018 16:41:32 +0330 
thomas lavocat  wrote  Hi everyone, 
The possibility to have in memory shuffling is discussed in this issue 
https://github.com/apache/spark/pull/5403. It was in 2015. In 2016 the paper 
"Scaling Spark on HPC Systems" says that Spark still shuffle using disks. I 
would like to know : What is the current state of in memory shuffling ? Is it 
implemented in production ? Does the current shuffle still use disks to work ? 
Is it possible to somehow do it in RAM only ? Regards, Thomas 
- To 
unsubscribe e-mail: user-unsubscr...@spark.apache.org

FW: Pyspark: set Orc Stripe.size on dataframe writer issue

2018-10-17 Thread Somasundara, Ashwin
Hello Group

I am having issues setting the stripe size, index stride and index on an orc 
file using PySpark.  I am getting approx 2000 stripes for the 1.2GB file when I 
am expecting only 5 stripes for the 256MB setting.

Tried the below options

1. Set the .options on data frame writer. The compression setting in .option 
worked but no other .option setting worked. Research the .option method in 
Dataframe class and it has only for compression and not for the stripe, index, 
and stride.

df.\
.repartition(custom field)\
.sortWithPartitions(custom field, sort field 1 , sort field 2)\
.write.format(orc)\
.option("compression","zlib")\ only this option worked
.option("preserveSortOrder","true")\
.options("orc.stripe.size","268435456")\
.option("orc.row.index.stride","true")\
.options("orc.create.index","true")\
.save(s3 location )


2. Created an empty HIVE table with above ORC setting and loaded into the table 
using spark  SaveAsTable and insertInto method. The resultant table had more 
stripes than anticipated

df.\
.repartition(custom field)\
.sortWithPartitions(custom field, sort field 1 , sort field 2)\
.write.format(orc)\
.mode("apped")
.saveAsTable(hive tablename )& tried .insertInto (hive table name)


For both the option I have enabled the below configs

spark.sql("set spark.sql.orc.impl=native")
spark.sql("set spark.sql.orc.enabled=true")
spark.sql("set spark.sql.orc.cache.stripe.details.size=" 268435456  ")

Please let me know if there are any missing piece of code or data frame writer 
level methods or Spark session level config that would enable us to get the 
desired results.




Spark In Memory Shuffle

2018-10-17 Thread thomas lavocat

Hi everyone,


The possibility to have in memory shuffling is discussed in this issue 
https://github.com/apache/spark/pull/5403. It was in 2015.


In 2016 the paper "Scaling Spark on HPC Systems" says that Spark still 
shuffle using disks. I would like to know :



What is the current state of in memory shuffling ?

Is it implemented in production ?

Does the current shuffle still use disks to work ?

Is it possible to somehow do it in RAM only ?


Regards,

Thomas


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SparkSQL read Hive transactional table

2018-10-17 Thread Gourav Sengupta
Hi,

I think that the speed of ORC has been improved in latest versions. Any
chance you could use the latest version?

Regards,
Gourav Sengupta

On 17 Oct 2018 6:11 am, "daily"  wrote:

Hi,

Spark version: 2.3.0
Hive   version: 2.1.0

Best regards.


-- 原始邮件 --
*发件人:* "Gourav Sengupta";
*发送时间:* 2018年10月16日(星期二) 晚上6:35
*收件人:* "daily";
*抄送:* "user"; "dev";
*主题:* Re: SparkSQL read Hive transactional table

Hi,

can I please ask which version of Hive and Spark are you using?

Regards,
Gourav Sengupta

On Tue, Oct 16, 2018 at 2:42 AM daily  wrote:

> Hi,
>
> I use HCatalog Streaming Mutation API to write data to hive transactional
> table, and then, I use SparkSQL to read data from the hive transactional
> table. I get the right result.
> However, SparkSQL uses more time to read hive orc bucket transactional
> table, beacause SparkSQL read all columns(not The columns involved in SQL)
> so it uses more time.
> My question is why that SparkSQL read all columns of hive orc bucket
> transactional table, but not the columns involved in SQL? Is it possible to
> control the SparkSQL read the columns involved in SQL?
>
>
>
> For example:
> Hive Table:
> create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6
> int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets
> stored as orc TBLPROPERTIES ('transactional'='true');
>
> create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6
> int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets
> stored as orc TBLPROPERTIES ('transactional'='false');
>
> SparkSQL:
> select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
> select sum(t1),sum(t2) from dbtest.t_a2 group by t0;
>
> SparkSQL's stage Input size:
>
> dbtest.t_a1=113.9 GB,
>
> dbtest.t_a2=96.5 MB
>
>
>
> Best regards.
>
>
>
>