Re: [spark-30111] k8s builds broken for sparkR

2019-12-03 Thread Shane Knapp
https://github.com/apache/spark/pull/26753

ilan, thanks for taking care of this!

the k8s prb and master test are green now, and the jdk11 build failed
for an unrelated reason.

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/


On Tue, Dec 3, 2019 at 10:21 AM Shane Knapp  wrote:
>
> FYI, i just filed https://issues.apache.org/jira/browse/SPARK-30111
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu



--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Slower than usual on PRs

2019-12-03 Thread Hyukjin Kwon
Yeah, please take care of your heath first!

2019년 12월 3일 (화) 오후 1:32, Wenchen Fan 님이 작성:

> Sorry to hear that. Hope you get better soon!
>
> On Tue, Dec 3, 2019 at 1:28 AM Holden Karau  wrote:
>
>> Hi Spark dev folks,
>>
>> Just an FYI I'm out dealing with recovering from a motorcycle accident so
>> my lack of (or slow) responses on PRs/docs is health related and please
>> don't block on any of my reviews. I'll do my best to find some OSS cycles
>> once I get back home.
>>
>> Cheers,
>>
>> Holden
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: Auto-linking Jira tickets to their PRs

2019-12-03 Thread Nicholas Chammas
Hmm, looks like something weird is going on, since it seems to be working
here  and here
. Perhaps it was
temporarily broken and is now working again.

On Tue, Dec 3, 2019 at 8:35 PM Hyukjin Kwon  wrote:

> I think it's broken .. cc Josh Rosen
>
> 2019년 12월 4일 (수) 오전 10:25, Nicholas Chammas 님이
> 작성:
>
>> We used to have a bot or something that automatically linked Jira tickets
>> to PRs that mentioned them in their title. I don't see that happening
>> anymore. 
>>
>> Did we intentionally remove this functionality, or is it temporarily
>> broken for some reason?
>>
>> Nick
>>
>>


Re: Auto-linking Jira tickets to their PRs

2019-12-03 Thread Hyukjin Kwon
I think it's broken .. cc Josh Rosen

2019년 12월 4일 (수) 오전 10:25, Nicholas Chammas 님이
작성:

> We used to have a bot or something that automatically linked Jira tickets
> to PRs that mentioned them in their title. I don't see that happening
> anymore. 
>
> Did we intentionally remove this functionality, or is it temporarily
> broken for some reason?
>
> Nick
>
>


Auto-linking Jira tickets to their PRs

2019-12-03 Thread Nicholas Chammas
We used to have a bot or something that automatically linked Jira tickets
to PRs that mentioned them in their title. I don't see that happening
anymore. 

Did we intentionally remove this functionality, or is it temporarily broken
for some reason?

Nick


[spark-30111] k8s builds broken for sparkR

2019-12-03 Thread Shane Knapp
FYI, i just filed https://issues.apache.org/jira/browse/SPARK-30111

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Support subdirectories when accessing partitioned Parquet Hive table

2019-12-03 Thread Lotkowski, Michael
Originally https://issues.apache.org/jira/browse/SPARK-30024


Hi all,

We have ran in to issues when trying to read parquet partitioned table created 
by Hive. I think I have narrowed down the cause to how 
InMemoryFileIndex 
created a parent -> file mapping.

The folder structure created by Hive is as follows:

s3://bucket/table/date=2019-11-25/subdir1/data1.parquet

s3://bucket/table/date=2019-11-25/subdir2/data2.parquet

Looking through the code it seems that InMemoryFileIndex is creating a mapping 
of leaf files to their parents yielding the following mapping:

 val leafDirToChildrenFiles = Map(

s3://bucket/table/date=2019-11-25/subdir1 -> 
s3://bucket/table/date=2019-11-25/subdir1/data1.parquet,

s3://bucket/table/date=2019-11-25/subdir2 -> 
s3://bucket/table/date=2019-11-25/subdir2/data2.parquet

)

Which then in turn is used in 
PartitioningAwareFileIndex

to prune the partitions. From my understanding pruning works by looking up the 
partition path in leafDirToChildrenFiles which in this case is 
s3://bucket/table/date=2019-11-25/ and therefore it fails to find any files for 
this partition.

My suggested fix is to update how the InMemoryFileIndex builds the mapping, 
instead of having a map between parent dir to file, is to have a map of 
rootPath to file. More concretely 
https://gist.github.com/lotkowskim/76e8ff265493efd0b2b7175446805a82

I have tested this by updating the jar running on EMR and we correctly can now 
read the data from these partitioned tables. It's also worth noting that we can 
read the data, without any modifications to the code, if we use the following 
settings:

"spark.sql.hive.convertMetastoreParquet" to "false",
"spark.hive.mapred.supports.subdirectories" to "true",
"spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive" to "true"

However with these settings we lose the ability to prune partitions causing us 
to read the entire table every time as we aren't using a Spark relation.

I want to start discussion on whether this is a correct change, or if we are 
missing something more obvious. In either case I would be happy to fully 
implement the change.

Thanks,

Michael




Amazon Development Centre (Scotland) Limited registered office: Waverley Gate, 
2-4 Waterloo Place, Edinburgh EH1 3EG, Scotland. Registered in Scotland 
Registration number SC26867