[jira] [Commented] (SPARK-16188) Spark sql create a lot of small files

2017-08-17 Thread cen yuhai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131796#comment-16131796
 ] 

cen yuhai commented on SPARK-16188:
---

[~xianlongZhang] yes, you are right, I has implemented this feature by adding 
repartition by rand() when is at analyzing phase.

> Spark sql create a lot of small files
> -
>
> Key: SPARK-16188
> URL: https://issues.apache.org/jira/browse/SPARK-16188
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: spark 1.6.1
>Reporter: cen yuhai
>
> I find that spark sql will create files as many as partition size. When the 
> results are small, there will be too many small files and most of them are 
> empty. 
> Hive have a function to detect the avg of file size. If  avg file size is 
> smaller than "hive.merge.smallfiles.avgsize", hive will add a job to merge 
> files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21782) Repartition creates skews when numPartitions is a power of 2

2017-08-17 Thread Sergey Serebryakov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131794#comment-16131794
 ] 

Sergey Serebryakov commented on SPARK-21782:


Your understanding is correct. Either reusing the same {{Random}} instance 
multiple times (not really an option as shuffle is parallel), using a better 
RNG, or substantially scrambling the seed (hashing?) will help.
Changing the "smearing" algorithm would also work, e.g. to something like this:
{code}
  val distributePartition = (index: Int, items: Iterator[T]) => {
 val rng = new Random(index)
 items.map { t => (rng.nextInt(numPartitions), t) }
   } : Iterator[(Int, T)]
{code}

Please let me know which way you'd like to see it.

> Repartition creates skews when numPartitions is a power of 2
> 
>
> Key: SPARK-21782
> URL: https://issues.apache.org/jira/browse/SPARK-21782
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sergey Serebryakov
>  Labels: repartition
> Attachments: Screen Shot 2017-08-16 at 3.40.01 PM.png
>
>
> *Problem:*
> When an RDD (particularly with a low item-per-partition ratio) is 
> repartitioned to {{numPartitions}} = power of 2, the resulting partitions are 
> very uneven-sized. This affects both {{repartition()}} and 
> {{coalesce(shuffle=true)}}.
> *Steps to reproduce:*
> {code}
> $ spark-shell
> scala> sc.parallelize(0 until 1000, 
> 250).repartition(64).glom().map(_.length).collect()
> res0: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 144, 250, 250, 250, 106, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
> {code}
> *Explanation:*
> Currently, the [algorithm for 
> repartition|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L450]
>  (shuffle-enabled coalesce) is as follows:
> - for each initial partition {{index}}, generate {{position}} as {{(new 
> Random(index)).nextInt(numPartitions)}}
> - then, for element number {{k}} in initial partition {{index}}, put it in 
> the new partition {{position + k}} (modulo {{numPartitions}}).
> So, essentially elements are smeared roughly equally over {{numPartitions}} 
> buckets - starting from the one with number {{position+1}}.
> Note that a new instance of {{Random}} is created for every initial partition 
> {{index}}, with a fixed seed {{index}}, and then discarded. So the 
> {{position}} is deterministic for every {{index}} for any RDD in the world. 
> Also, [{{nextInt(bound)}} 
> implementation|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/Random.java/#393]
>  has a special case when {{bound}} is a power of 2, which is basically taking 
> several highest bits from the initial seed, with only a minimal scrambling.
> Due to deterministic seed, using the generator only once, and lack of 
> scrambling, the {{position}} values for power-of-two {{numPartitions}} always 
> end up being almost the same regardless of the {{index}}, causing some 
> buckets to be much more popular than others. So, {{repartition}} will in fact 
> intentionally produce skewed partitions even when before the partition were 
> roughly equal in size.
> The behavior seems to have been introduced in SPARK-1770 by 
> https://github.com/apache/spark/pull/727/
> {quote}
> The load balancing is not perfect: a given output partition
> can have up to N more elements than the average if there are N input
> partitions. However, some randomization is used to minimize the
> probabiliy that this happens.
> {quote}
> Another related ticket: SPARK-17817 - 
> https://github.com/apache/spark/pull/15445



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21771) SparkSQLEnv creates a useless meta hive client

2017-08-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21771:
--
Issue Type: Improvement  (was: Bug)

> SparkSQLEnv creates a useless meta hive client
> --
>
> Key: SPARK-21771
> URL: https://issues.apache.org/jira/browse/SPARK-21771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kent Yao
>Priority: Minor
>
> Once a meta hive client is created, it generates its SessionState which 
> creates a lot of session related directories, some deleteOnExit, some does 
> not. if a hive client is useless we may not create it at the very start.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21782) Repartition creates skews when numPartitions is a power of 2

2017-08-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131784#comment-16131784
 ] 

Sean Owen commented on SPARK-21782:
---

Is the problem summary just: with a power of 2 bound, similar seeds give 
similar output?
Isn't that solved with a better RNG or just simple scrambling of the seed bits?
The seed is actually essential.

> Repartition creates skews when numPartitions is a power of 2
> 
>
> Key: SPARK-21782
> URL: https://issues.apache.org/jira/browse/SPARK-21782
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sergey Serebryakov
>  Labels: repartition
> Attachments: Screen Shot 2017-08-16 at 3.40.01 PM.png
>
>
> *Problem:*
> When an RDD (particularly with a low item-per-partition ratio) is 
> repartitioned to {{numPartitions}} = power of 2, the resulting partitions are 
> very uneven-sized. This affects both {{repartition()}} and 
> {{coalesce(shuffle=true)}}.
> *Steps to reproduce:*
> {code}
> $ spark-shell
> scala> sc.parallelize(0 until 1000, 
> 250).repartition(64).glom().map(_.length).collect()
> res0: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 144, 250, 250, 250, 106, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
> {code}
> *Explanation:*
> Currently, the [algorithm for 
> repartition|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L450]
>  (shuffle-enabled coalesce) is as follows:
> - for each initial partition {{index}}, generate {{position}} as {{(new 
> Random(index)).nextInt(numPartitions)}}
> - then, for element number {{k}} in initial partition {{index}}, put it in 
> the new partition {{position + k}} (modulo {{numPartitions}}).
> So, essentially elements are smeared roughly equally over {{numPartitions}} 
> buckets - starting from the one with number {{position+1}}.
> Note that a new instance of {{Random}} is created for every initial partition 
> {{index}}, with a fixed seed {{index}}, and then discarded. So the 
> {{position}} is deterministic for every {{index}} for any RDD in the world. 
> Also, [{{nextInt(bound)}} 
> implementation|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/Random.java/#393]
>  has a special case when {{bound}} is a power of 2, which is basically taking 
> several highest bits from the initial seed, with only a minimal scrambling.
> Due to deterministic seed, using the generator only once, and lack of 
> scrambling, the {{position}} values for power-of-two {{numPartitions}} always 
> end up being almost the same regardless of the {{index}}, causing some 
> buckets to be much more popular than others. So, {{repartition}} will in fact 
> intentionally produce skewed partitions even when before the partition were 
> roughly equal in size.
> The behavior seems to have been introduced in SPARK-1770 by 
> https://github.com/apache/spark/pull/727/
> {quote}
> The load balancing is not perfect: a given output partition
> can have up to N more elements than the average if there are N input
> partitions. However, some randomization is used to minimize the
> probabiliy that this happens.
> {quote}
> Another related ticket: SPARK-17817 - 
> https://github.com/apache/spark/pull/15445



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-17 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131775#comment-16131775
 ] 

Sean Owen commented on SPARK-21770:
---

What's the current behavior for the prediction? I feel like we established this 
behavior on purpose. 

If all 0 is the output it should not be modified. The question that matters is 
what is predicted. 

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16188) Spark sql create a lot of small files

2017-08-17 Thread xianlongZhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131772#comment-16131772
 ] 

xianlongZhang commented on SPARK-16188:
---

cen yuhai,thanks for your advice, but  my company's data platform processed  
tens of thousands of sql query  every day , it is not practical to modify  each 
sql , so I think add  a common mechanism to deal with this problem is the most 
suitable solution

> Spark sql create a lot of small files
> -
>
> Key: SPARK-16188
> URL: https://issues.apache.org/jira/browse/SPARK-16188
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: spark 1.6.1
>Reporter: cen yuhai
>
> I find that spark sql will create files as many as partition size. When the 
> results are small, there will be too many small files and most of them are 
> empty. 
> Hive have a function to detect the avg of file size. If  avg file size is 
> smaller than "hive.merge.smallfiles.avgsize", hive will add a job to merge 
> files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21776) How to use the memory-mapped file on Spark??

2017-08-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21776.
---
Resolution: Invalid

> How to use the memory-mapped file on Spark??
> 
>
> Key: SPARK-21776
> URL: https://issues.apache.org/jira/browse/SPARK-21776
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Documentation, Input/Output, Spark Core
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1 
> Scala 2.11.8
>Reporter: zhaP524
>Priority: Trivial
> Attachments: screenshot-1.png, screenshot-2.png
>
>
>   In generation, we have to use the Spark full quantity loaded HBase 
> table based on one dimension table to generate business, because the base 
> table is total quantity loaded, the memory will pressure is very big, I want 
> to see if the Spark can use this way to deal with memory mapped file?Is there 
> such a mechanism?How do you use it?
>   And I found in the Spark a parameter: 
> spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
> used for?
>There is a putBytes and getBytes method in DiskStore.scala with Spark 
> source code, is this the memory-mapped file mentioned above?How to understand?
>Let me know if you have any trouble..
> Wish to You!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21776) How to use the memory-mapped file on Spark??

2017-08-17 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21776:
--
Priority: Trivial  (was: Blocker)

> How to use the memory-mapped file on Spark??
> 
>
> Key: SPARK-21776
> URL: https://issues.apache.org/jira/browse/SPARK-21776
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Documentation, Input/Output, Spark Core
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1 
> Scala 2.11.8
>Reporter: zhaP524
>Priority: Trivial
> Attachments: screenshot-1.png, screenshot-2.png
>
>
>   In generation, we have to use the Spark full quantity loaded HBase 
> table based on one dimension table to generate business, because the base 
> table is total quantity loaded, the memory will pressure is very big, I want 
> to see if the Spark can use this way to deal with memory mapped file?Is there 
> such a mechanism?How do you use it?
>   And I found in the Spark a parameter: 
> spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
> used for?
>There is a putBytes and getBytes method in DiskStore.scala with Spark 
> source code, is this the memory-mapped file mentioned above?How to understand?
>Let me know if you have any trouble..
> Wish to You!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21782) Repartition creates skews when numPartitions is a power of 2

2017-08-17 Thread Sergey Serebryakov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Serebryakov updated SPARK-21782:
---
Attachment: Screen Shot 2017-08-16 at 3.40.01 PM.png

Distribution of partition sizes (in bytes) spotted in the wild. Horizontal 
axis: partition index ({{0..1023}}). Vertical axis: partition size in bytes.

> Repartition creates skews when numPartitions is a power of 2
> 
>
> Key: SPARK-21782
> URL: https://issues.apache.org/jira/browse/SPARK-21782
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sergey Serebryakov
>  Labels: repartition
> Attachments: Screen Shot 2017-08-16 at 3.40.01 PM.png
>
>
> *Problem:*
> When an RDD (particularly with a low item-per-partition ratio) is 
> repartitioned to {{numPartitions}} = power of 2, the resulting partitions are 
> very uneven-sized. This affects both {{repartition()}} and 
> {{coalesce(shuffle=true)}}.
> *Steps to reproduce:*
> {code}
> $ spark-shell
> scala> sc.parallelize(0 until 1000, 
> 250).repartition(64).glom().map(_.length).collect()
> res0: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0, 144, 250, 250, 250, 106, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
> {code}
> *Explanation:*
> Currently, the [algorithm for 
> repartition|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L450]
>  (shuffle-enabled coalesce) is as follows:
> - for each initial partition {{index}}, generate {{position}} as {{(new 
> Random(index)).nextInt(numPartitions)}}
> - then, for element number {{k}} in initial partition {{index}}, put it in 
> the new partition {{position + k}} (modulo {{numPartitions}}).
> So, essentially elements are smeared roughly equally over {{numPartitions}} 
> buckets - starting from the one with number {{position+1}}.
> Note that a new instance of {{Random}} is created for every initial partition 
> {{index}}, with a fixed seed {{index}}, and then discarded. So the 
> {{position}} is deterministic for every {{index}} for any RDD in the world. 
> Also, [{{nextInt(bound)}} 
> implementation|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/Random.java/#393]
>  has a special case when {{bound}} is a power of 2, which is basically taking 
> several highest bits from the initial seed, with only a minimal scrambling.
> Due to deterministic seed, using the generator only once, and lack of 
> scrambling, the {{position}} values for power-of-two {{numPartitions}} always 
> end up being almost the same regardless of the {{index}}, causing some 
> buckets to be much more popular than others. So, {{repartition}} will in fact 
> intentionally produce skewed partitions even when before the partition were 
> roughly equal in size.
> The behavior seems to have been introduced in SPARK-1770 by 
> https://github.com/apache/spark/pull/727/
> {quote}
> The load balancing is not perfect: a given output partition
> can have up to N more elements than the average if there are N input
> partitions. However, some randomization is used to minimize the
> probabiliy that this happens.
> {quote}
> Another related ticket: SPARK-17817 - 
> https://github.com/apache/spark/pull/15445



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21782) Repartition creates skews when numPartitions is a power of 2

2017-08-17 Thread Sergey Serebryakov (JIRA)
Sergey Serebryakov created SPARK-21782:
--

 Summary: Repartition creates skews when numPartitions is a power 
of 2
 Key: SPARK-21782
 URL: https://issues.apache.org/jira/browse/SPARK-21782
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Sergey Serebryakov


*Problem:*
When an RDD (particularly with a low item-per-partition ratio) is repartitioned 
to {{numPartitions}} = power of 2, the resulting partitions are very 
uneven-sized. This affects both {{repartition()}} and 
{{coalesce(shuffle=true)}}.

*Steps to reproduce:*

{code}
$ spark-shell

scala> sc.parallelize(0 until 1000, 
250).repartition(64).glom().map(_.length).collect()
res0: Array[Int] = Array(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 144, 250, 250, 250, 106, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
{code}

*Explanation:*
Currently, the [algorithm for 
repartition|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L450]
 (shuffle-enabled coalesce) is as follows:
- for each initial partition {{index}}, generate {{position}} as {{(new 
Random(index)).nextInt(numPartitions)}}
- then, for element number {{k}} in initial partition {{index}}, put it in the 
new partition {{position + k}} (modulo {{numPartitions}}).

So, essentially elements are smeared roughly equally over {{numPartitions}} 
buckets - starting from the one with number {{position+1}}.

Note that a new instance of {{Random}} is created for every initial partition 
{{index}}, with a fixed seed {{index}}, and then discarded. So the {{position}} 
is deterministic for every {{index}} for any RDD in the world. Also, 
[{{nextInt(bound)}} 
implementation|http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/util/Random.java/#393]
 has a special case when {{bound}} is a power of 2, which is basically taking 
several highest bits from the initial seed, with only a minimal scrambling.

Due to deterministic seed, using the generator only once, and lack of 
scrambling, the {{position}} values for power-of-two {{numPartitions}} always 
end up being almost the same regardless of the {{index}}, causing some buckets 
to be much more popular than others. So, {{repartition}} will in fact 
intentionally produce skewed partitions even when before the partition were 
roughly equal in size.

The behavior seems to have been introduced in SPARK-1770 by 
https://github.com/apache/spark/pull/727/
{quote}
The load balancing is not perfect: a given output partition
can have up to N more elements than the average if there are N input
partitions. However, some randomization is used to minimize the
probabiliy that this happens.
{quote}

Another related ticket: SPARK-17817 - https://github.com/apache/spark/pull/15445



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21739) timestamp partition would fail in v2.2.0

2017-08-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21739.
-
   Resolution: Fixed
 Assignee: Feng Zhu
Fix Version/s: 2.3.0
   2.2.1

> timestamp partition would fail in v2.2.0
> 
>
> Key: SPARK-21739
> URL: https://issues.apache.org/jira/browse/SPARK-21739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: wangzhihao
>Assignee: Feng Zhu
>Priority: Critical
> Fix For: 2.2.1, 2.3.0
>
>
> The spark v2.2.0 introduce TimeZoneAwareExpression, which causes bugs if we 
> select data from a table with timestamp partitions.
> The steps to reproduce it:
> {code:java}
> spark.sql("create table test (foo string) parititioned by (ts timestamp)")
> spark.sql("insert into table test partition(ts = 1) values('hi')")
> spark.table("test").show()
> {code}
> The root cause is that TableReader.scala#230 try to cast the string to 
> timestamp regardless if the timeZone exists.
> Here is the error stack trace
> {code}
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression$class.timeZone(datetimeExpressions.scala:46)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.timeZone$lzycompute(Cast.scala:172)
>   
>at 
> org.apache.spark.sql.catalyst.expressions.Cast.timeZone(Cast.scala:172)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1$$anonfun$apply$24.apply(Cast.scala:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToTimestamp$1.apply(Cast.scala:253)
>   at 
> org.apache.spark.sql.catalyst.expressions.Cast.nullSafeEval(Cast.scala:533)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:327)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:230)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$fillPartitionKeys$1$1.apply(TableReader.scala:228)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21776) How to use the memory-mapped file on Spark??

2017-08-17 Thread zhaP524 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131749#comment-16131749
 ] 

zhaP524 edited comment on SPARK-21776 at 8/18/17 5:36 AM:
--

[~kiszk]  I see ,  I  have changed the type of question,This problem has caused 
my business to be suspended 


was (Author: 扎啤):
@Kazuaki Ishizaki  I see ,  I  have changed the type of question,This problem 
has caused my business to be suspended 

> How to use the memory-mapped file on Spark??
> 
>
> Key: SPARK-21776
> URL: https://issues.apache.org/jira/browse/SPARK-21776
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Documentation, Input/Output, Spark Core
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1 
> Scala 2.11.8
>Reporter: zhaP524
>Priority: Blocker
> Attachments: screenshot-1.png, screenshot-2.png
>
>
>   In generation, we have to use the Spark full quantity loaded HBase 
> table based on one dimension table to generate business, because the base 
> table is total quantity loaded, the memory will pressure is very big, I want 
> to see if the Spark can use this way to deal with memory mapped file?Is there 
> such a mechanism?How do you use it?
>   And I found in the Spark a parameter: 
> spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
> used for?
>There is a putBytes and getBytes method in DiskStore.scala with Spark 
> source code, is this the memory-mapped file mentioned above?How to understand?
>Let me know if you have any trouble..
> Wish to You!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21776) How to use the memory-mapped file on Spark??

2017-08-17 Thread zhaP524 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131749#comment-16131749
 ] 

zhaP524 commented on SPARK-21776:
-

@Kazuaki Ishizaki  I see ,  I  have changed the type of question,This problem 
has caused my business to be suspended 

> How to use the memory-mapped file on Spark??
> 
>
> Key: SPARK-21776
> URL: https://issues.apache.org/jira/browse/SPARK-21776
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Documentation, Input/Output, Spark Core
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1 
> Scala 2.11.8
>Reporter: zhaP524
>Priority: Blocker
> Attachments: screenshot-1.png, screenshot-2.png
>
>
>   In generation, we have to use the Spark full quantity loaded HBase 
> table based on one dimension table to generate business, because the base 
> table is total quantity loaded, the memory will pressure is very big, I want 
> to see if the Spark can use this way to deal with memory mapped file?Is there 
> such a mechanism?How do you use it?
>   And I found in the Spark a parameter: 
> spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
> used for?
>There is a putBytes and getBytes method in DiskStore.scala with Spark 
> source code, is this the memory-mapped file mentioned above?How to understand?
>Let me know if you have any trouble..
> Wish to You!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21776) How to use the memory-mapped file on Spark??

2017-08-17 Thread zhaP524 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaP524 updated SPARK-21776:

  Priority: Blocker  (was: Major)
Issue Type: Improvement  (was: Bug)

> How to use the memory-mapped file on Spark??
> 
>
> Key: SPARK-21776
> URL: https://issues.apache.org/jira/browse/SPARK-21776
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Documentation, Input/Output, Spark Core
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1 
> Scala 2.11.8
>Reporter: zhaP524
>Priority: Blocker
> Attachments: screenshot-1.png, screenshot-2.png
>
>
>   In generation, we have to use the Spark full quantity loaded HBase 
> table based on one dimension table to generate business, because the base 
> table is total quantity loaded, the memory will pressure is very big, I want 
> to see if the Spark can use this way to deal with memory mapped file?Is there 
> such a mechanism?How do you use it?
>   And I found in the Spark a parameter: 
> spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
> used for?
>There is a putBytes and getBytes method in DiskStore.scala with Spark 
> source code, is this the memory-mapped file mentioned above?How to understand?
>Let me know if you have any trouble..
> Wish to You!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21781) Modify DataSourceScanExec to use concrete ColumnVector type.

2017-08-17 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-21781:
-

 Summary: Modify DataSourceScanExec to use concrete ColumnVector 
type.
 Key: SPARK-21781
 URL: https://issues.apache.org/jira/browse/SPARK-21781
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Takuya Ueshin


As mentioned at 
https://github.com/apache/spark/pull/18680#issuecomment-316820409, when we have 
more {{ColumnVector}} implementations, it might (or might not) have huge 
performance implications because it might disable inlining, or force virtual 
dispatches.

As for read path, one of the major paths is the one generated by 
{{ColumnBatchScan}}. Currently it refers {{ColumnVector}} so the penalty will 
be bigger as we have more classes, but we can know the concrete type from its 
usage, e.g. vectorized Parquet reader uses {{OnHeapColumnVector}}. We can use 
the concrete type in the generated code directly to avoid the penalty.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21778) Simpler Dataset.sample API in Scala / Java

2017-08-17 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-21778:

Summary: Simpler Dataset.sample API in Scala / Java  (was: Simpler 
Dataset.sample API in Scala)

> Simpler Dataset.sample API in Scala / Java
> --
>
> Key: SPARK-21778
> URL: https://issues.apache.org/jira/browse/SPARK-21778
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> See parent ticket.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21779) Simpler Dataset.sample API in Python

2017-08-17 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-21779:
---

 Summary: Simpler Dataset.sample API in Python
 Key: SPARK-21779
 URL: https://issues.apache.org/jira/browse/SPARK-21779
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Reynold Xin


See parent ticket.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21780) Simpler Dataset.sample API in R

2017-08-17 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-21780:
---

 Summary: Simpler Dataset.sample API in R
 Key: SPARK-21780
 URL: https://issues.apache.org/jira/browse/SPARK-21780
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Reynold Xin


See parent ticket.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21778) Simpler Dataset.sample API in Scala

2017-08-17 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-21778:
---

 Summary: Simpler Dataset.sample API in Scala
 Key: SPARK-21778
 URL: https://issues.apache.org/jira/browse/SPARK-21778
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Reynold Xin


See parent ticket.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21777) Simpler Dataset.sample API

2017-08-17 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-21777:
---

 Summary: Simpler Dataset.sample API
 Key: SPARK-21777
 URL: https://issues.apache.org/jira/browse/SPARK-21777
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.0
Reporter: Reynold Xin
Assignee: Reynold Xin


Dataset.sample requires a boolean flag withReplacement as the first argument. 
However, most of the time users simply want to sample some records without 
replacement. This ticket introduces a new sample function that simply takes in 
the fraction and seed.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type

2017-08-17 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-21774:
-
External issue URL:   (was: https://github.com/apache/spark/pull/18986)

> The rule PromoteStrings cast string to a wrong data type
> 
>
> Key: SPARK-21774
> URL: https://issues.apache.org/jira/browse/SPARK-21774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StanZhai
>Priority: Critical
>  Labels: correctness
>
> Data
> {code}
> create temporary view tb as select * from values
> ("0", 1),
> ("-0.1", 2),
> ("1", 3)
> as grouping(a, b)
> {code}
> SQL:
> {code}
> select a, b from tb where a=0
> {code}
> The result which is wrong:
> {code}
> ++---+
> |   a|  b|
> ++---+
> |   0|  1|
> |-0.1|  2|
> ++---+
> {code}
> Logical Plan:
> {code}
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'Filter ('a = 0)
>+- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> a: string
> Project [a#8528]
> +- Filter (cast(a#8528 as int) = 0)
>+- SubqueryAlias src
>   +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529]
>  +- LocalRelation [_1#8525, _2#8526]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type

2017-08-17 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-21774:
-
External issue URL: https://github.com/apache/spark/pull/18986

> The rule PromoteStrings cast string to a wrong data type
> 
>
> Key: SPARK-21774
> URL: https://issues.apache.org/jira/browse/SPARK-21774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StanZhai
>Priority: Critical
>  Labels: correctness
>
> Data
> {code}
> create temporary view tb as select * from values
> ("0", 1),
> ("-0.1", 2),
> ("1", 3)
> as grouping(a, b)
> {code}
> SQL:
> {code}
> select a, b from tb where a=0
> {code}
> The result which is wrong:
> {code}
> ++---+
> |   a|  b|
> ++---+
> |   0|  1|
> |-0.1|  2|
> ++---+
> {code}
> Logical Plan:
> {code}
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'Filter ('a = 0)
>+- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> a: string
> Project [a#8528]
> +- Filter (cast(a#8528 as int) = 0)
>+- SubqueryAlias src
>   +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529]
>  +- LocalRelation [_1#8525, _2#8526]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type

2017-08-17 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-21774:
-
External issue ID:   (was: SPARK-21646)

> The rule PromoteStrings cast string to a wrong data type
> 
>
> Key: SPARK-21774
> URL: https://issues.apache.org/jira/browse/SPARK-21774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StanZhai
>Priority: Critical
>  Labels: correctness
>
> Data
> {code}
> create temporary view tb as select * from values
> ("0", 1),
> ("-0.1", 2),
> ("1", 3)
> as grouping(a, b)
> {code}
> SQL:
> {code}
> select a, b from tb where a=0
> {code}
> The result which is wrong:
> {code}
> ++---+
> |   a|  b|
> ++---+
> |   0|  1|
> |-0.1|  2|
> ++---+
> {code}
> Logical Plan:
> {code}
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'Filter ('a = 0)
>+- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> a: string
> Project [a#8528]
> +- Filter (cast(a#8528 as int) = 0)
>+- SubqueryAlias src
>   +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529]
>  +- LocalRelation [_1#8525, _2#8526]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type

2017-08-17 Thread StanZhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

StanZhai updated SPARK-21774:
-
Description: 
Data
{code}
create temporary view tb as select * from values
("0", 1),
("-0.1", 2),
("1", 3)
as grouping(a, b)
{code}

SQL:
{code}
select a, b from tb where a=0
{code}

The result which is wrong:
{code}
++---+
|   a|  b|
++---+
|   0|  1|
|-0.1|  2|
++---+
{code}

Logical Plan:
{code}
== Parsed Logical Plan ==
'Project ['a]
+- 'Filter ('a = 0)
   +- 'UnresolvedRelation `src`

== Analyzed Logical Plan ==
a: string
Project [a#8528]
+- Filter (cast(a#8528 as int) = 0)
   +- SubqueryAlias src
  +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529]
 +- LocalRelation [_1#8525, _2#8526]
{code}

  was:
Data
{code}
create temporary view tb as select * from values
("0", 1),
("-0.1", 2),
("1", 3)
as grouping(a, b)
{code}

SQL:
{code}
select a, b from tb where a=0
{code}

The result which is wrong:
{code}
++---+
|   a|  b|
++---+
|   0|  1|
|-0.1|  2|
++---+
{code}


> The rule PromoteStrings cast string to a wrong data type
> 
>
> Key: SPARK-21774
> URL: https://issues.apache.org/jira/browse/SPARK-21774
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StanZhai
>Priority: Critical
>  Labels: correctness
>
> Data
> {code}
> create temporary view tb as select * from values
> ("0", 1),
> ("-0.1", 2),
> ("1", 3)
> as grouping(a, b)
> {code}
> SQL:
> {code}
> select a, b from tb where a=0
> {code}
> The result which is wrong:
> {code}
> ++---+
> |   a|  b|
> ++---+
> |   0|  1|
> |-0.1|  2|
> ++---+
> {code}
> Logical Plan:
> {code}
> == Parsed Logical Plan ==
> 'Project ['a]
> +- 'Filter ('a = 0)
>+- 'UnresolvedRelation `src`
> == Analyzed Logical Plan ==
> a: string
> Project [a#8528]
> +- Filter (cast(a#8528 as int) = 0)
>+- SubqueryAlias src
>   +- Project [_1#8525 AS a#8528, _2#8526 AS b#8529]
>  +- LocalRelation [_1#8525, _2#8526]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21776) How to use the memory-mapped file on Spark??

2017-08-17 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131681#comment-16131681
 ] 

Kazuaki Ishizaki commented on SPARK-21776:
--

Is this a question? It this is a kind of questions, it would be good to send a 
message to u...@spark.apache.org OR d...@spark.apache.org.

> How to use the memory-mapped file on Spark??
> 
>
> Key: SPARK-21776
> URL: https://issues.apache.org/jira/browse/SPARK-21776
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Documentation, Input/Output, Spark Core
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1 
> Scala 2.11.8
>Reporter: zhaP524
> Attachments: screenshot-1.png, screenshot-2.png
>
>
>   In generation, we have to use the Spark full quantity loaded HBase 
> table based on one dimension table to generate business, because the base 
> table is total quantity loaded, the memory will pressure is very big, I want 
> to see if the Spark can use this way to deal with memory mapped file?Is there 
> such a mechanism?How do you use it?
>   And I found in the Spark a parameter: 
> spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
> used for?
>There is a putBytes and getBytes method in DiskStore.scala with Spark 
> source code, is this the memory-mapped file mentioned above?How to understand?
>Let me know if you have any trouble..
> Wish to You!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21776) How to use the memory-mapped file on Spark??

2017-08-17 Thread zhaP524 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaP524 updated SPARK-21776:

Component/s: Spark Core
 Issue Type: Bug  (was: Question)

> How to use the memory-mapped file on Spark??
> 
>
> Key: SPARK-21776
> URL: https://issues.apache.org/jira/browse/SPARK-21776
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Documentation, Input/Output, Spark Core
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1 
> Scala 2.11.8
>Reporter: zhaP524
> Attachments: screenshot-1.png, screenshot-2.png
>
>
>   In generation, we have to use the Spark full quantity loaded HBase 
> table based on one dimension table to generate business, because the base 
> table is total quantity loaded, the memory will pressure is very big, I want 
> to see if the Spark can use this way to deal with memory mapped file?Is there 
> such a mechanism?How do you use it?
>   And I found in the Spark a parameter: 
> spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
> used for?
>There is a putBytes and getBytes method in DiskStore.scala with Spark 
> source code, is this the memory-mapped file mentioned above?How to understand?
>Let me know if you have any trouble..
> Wish to You!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21776) How to use the memory-mapped file on Spark??

2017-08-17 Thread zhaP524 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaP524 updated SPARK-21776:

Description: 
  In generation, we have to use the Spark full quantity loaded HBase table 
based on one dimension table to generate business, because the base table is 
total quantity loaded, the memory will pressure is very big, I want to see if 
the Spark can use this way to deal with memory mapped file?Is there such a 
mechanism?How do you use it?
  And I found in the Spark a parameter: 
spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
used for?
   There is a putBytes and getBytes method in DiskStore.scala with Spark 
source code, is this the memory-mapped file mentioned above?How to understand?
   Let me know if you have any trouble..

Wish to You!!

  was:
  In generation, we have to use the Spark full quantity loaded HBase table 
based on one dimension table to generate business, because the base table is 
total quantity loaded, the memory will pressure is very big, I want to see if 
the Spark can use this way to deal with memory mapped file?Is there such a 
mechanism?How do you use it?
  And I found in the Spark a parameter: 
spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
used for?
   There is a putBytes and getBytes method in DiskStore.scala with Spark 
source code, is this the memory-mapped file mentioned above?How to understand??
   Let me know if you have any trouble..

Wish to You!!

Summary: How to use the memory-mapped file on Spark??  (was: How to use 
the memory-mapped file on Spark???)

> How to use the memory-mapped file on Spark??
> 
>
> Key: SPARK-21776
> URL: https://issues.apache.org/jira/browse/SPARK-21776
> Project: Spark
>  Issue Type: Question
>  Components: Block Manager, Documentation, Input/Output
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1 
> Scala 2.11.8
>Reporter: zhaP524
> Attachments: screenshot-1.png, screenshot-2.png
>
>
>   In generation, we have to use the Spark full quantity loaded HBase 
> table based on one dimension table to generate business, because the base 
> table is total quantity loaded, the memory will pressure is very big, I want 
> to see if the Spark can use this way to deal with memory mapped file?Is there 
> such a mechanism?How do you use it?
>   And I found in the Spark a parameter: 
> spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
> used for?
>There is a putBytes and getBytes method in DiskStore.scala with Spark 
> source code, is this the memory-mapped file mentioned above?How to understand?
>Let me know if you have any trouble..
> Wish to You!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21776) How to use the memory-mapped file on Spark???

2017-08-17 Thread zhaP524 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131672#comment-16131672
 ] 

zhaP524 commented on SPARK-21776:
-

!screenshot-2.png!

> How to use the memory-mapped file on Spark???
> -
>
> Key: SPARK-21776
> URL: https://issues.apache.org/jira/browse/SPARK-21776
> Project: Spark
>  Issue Type: Question
>  Components: Block Manager, Documentation, Input/Output
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1 
> Scala 2.11.8
>Reporter: zhaP524
> Attachments: screenshot-1.png, screenshot-2.png
>
>
>   In generation, we have to use the Spark full quantity loaded HBase 
> table based on one dimension table to generate business, because the base 
> table is total quantity loaded, the memory will pressure is very big, I want 
> to see if the Spark can use this way to deal with memory mapped file?Is there 
> such a mechanism?How do you use it?
>   And I found in the Spark a parameter: 
> spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
> used for?
>There is a putBytes and getBytes method in DiskStore.scala with Spark 
> source code, is this the memory-mapped file mentioned above?How to 
> understand??
>Let me know if you have any trouble..
> Wish to You!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21776) How to use the memory-mapped file on Spark???

2017-08-17 Thread zhaP524 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131671#comment-16131671
 ] 

zhaP524 commented on SPARK-21776:
-

!screenshot-1.png!

> How to use the memory-mapped file on Spark???
> -
>
> Key: SPARK-21776
> URL: https://issues.apache.org/jira/browse/SPARK-21776
> Project: Spark
>  Issue Type: Question
>  Components: Block Manager, Documentation, Input/Output
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1 
> Scala 2.11.8
>Reporter: zhaP524
> Attachments: screenshot-1.png
>
>
>   In generation, we have to use the Spark full quantity loaded HBase 
> table based on one dimension table to generate business, because the base 
> table is total quantity loaded, the memory will pressure is very big, I want 
> to see if the Spark can use this way to deal with memory mapped file?Is there 
> such a mechanism?How do you use it?
>   And I found in the Spark a parameter: 
> spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
> used for?
>There is a putBytes and getBytes method in DiskStore.scala with Spark 
> source code, is this the memory-mapped file mentioned above?How to 
> understand??
>Let me know if you have any trouble..
> Wish to You!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21776) How to use the memory-mapped file on Spark???

2017-08-17 Thread zhaP524 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaP524 updated SPARK-21776:

Attachment: screenshot-1.png

> How to use the memory-mapped file on Spark???
> -
>
> Key: SPARK-21776
> URL: https://issues.apache.org/jira/browse/SPARK-21776
> Project: Spark
>  Issue Type: Question
>  Components: Block Manager, Documentation, Input/Output
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1 
> Scala 2.11.8
>Reporter: zhaP524
> Attachments: screenshot-1.png
>
>
>   In generation, we have to use the Spark full quantity loaded HBase 
> table based on one dimension table to generate business, because the base 
> table is total quantity loaded, the memory will pressure is very big, I want 
> to see if the Spark can use this way to deal with memory mapped file?Is there 
> such a mechanism?How do you use it?
>   And I found in the Spark a parameter: 
> spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
> used for?
>There is a putBytes and getBytes method in DiskStore.scala with Spark 
> source code, is this the memory-mapped file mentioned above?How to 
> understand??
>Let me know if you have any trouble..
> Wish to You!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21776) How to use the memory-mapped file on Spark???

2017-08-17 Thread zhaP524 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaP524 updated SPARK-21776:

Attachment: screenshot-2.png

> How to use the memory-mapped file on Spark???
> -
>
> Key: SPARK-21776
> URL: https://issues.apache.org/jira/browse/SPARK-21776
> Project: Spark
>  Issue Type: Question
>  Components: Block Manager, Documentation, Input/Output
>Affects Versions: 2.1.1
> Environment: Spark 2.1.1 
> Scala 2.11.8
>Reporter: zhaP524
> Attachments: screenshot-1.png, screenshot-2.png
>
>
>   In generation, we have to use the Spark full quantity loaded HBase 
> table based on one dimension table to generate business, because the base 
> table is total quantity loaded, the memory will pressure is very big, I want 
> to see if the Spark can use this way to deal with memory mapped file?Is there 
> such a mechanism?How do you use it?
>   And I found in the Spark a parameter: 
> spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
> used for?
>There is a putBytes and getBytes method in DiskStore.scala with Spark 
> source code, is this the memory-mapped file mentioned above?How to 
> understand??
>Let me know if you have any trouble..
> Wish to You!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21776) How to use the memory-mapped file on Spark???

2017-08-17 Thread zhaP524 (JIRA)
zhaP524 created SPARK-21776:
---

 Summary: How to use the memory-mapped file on Spark???
 Key: SPARK-21776
 URL: https://issues.apache.org/jira/browse/SPARK-21776
 Project: Spark
  Issue Type: Question
  Components: Block Manager, Documentation, Input/Output
Affects Versions: 2.1.1
 Environment: Spark 2.1.1 
Scala 2.11.8
Reporter: zhaP524


  In generation, we have to use the Spark full quantity loaded HBase table 
based on one dimension table to generate business, because the base table is 
total quantity loaded, the memory will pressure is very big, I want to see if 
the Spark can use this way to deal with memory mapped file?Is there such a 
mechanism?How do you use it?
  And I found in the Spark a parameter: 
spark.storage.memoryMapThreshold=2m, is not very clear what this parameter is 
used for?
   There is a putBytes and getBytes method in DiskStore.scala with Spark 
source code, is this the memory-mapped file mentioned above?How to understand??
   Let me know if you have any trouble..

Wish to You!!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5073) "spark.storage.memoryMapThreshold" has two default values

2017-08-17 Thread zhaP524 (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131667#comment-16131667
 ] 

zhaP524 commented on SPARK-5073:


I wonder what this parameter is for?Also want to know if this parameter is 
related to the memory-mapped file?I hope You can tell you the trouble.

> "spark.storage.memoryMapThreshold" has two default values
> -
>
> Key: SPARK-5073
> URL: https://issues.apache.org/jira/browse/SPARK-5073
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Jianhui Yuan
>Priority: Minor
>
> In org.apache.spark.storage.DiskStore:
>  val minMemoryMapBytes = 
> blockManager.conf.getLong("spark.storage.memoryMapThreshold", 2 * 4096L)
> In org.apache.spark.network.util.TransportConf:
>  public int memoryMapBytes() {
>  return conf.getInt("spark.storage.memoryMapThreshold", 2 * 1024 * 
> 1024);
>  }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21775) Dynamic Log Level Settings for executors

2017-08-17 Thread LvDongrong (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

LvDongrong updated SPARK-21775:
---
Attachment: web.PNG
terminal.PNG

I changed the loglevel of driver  to debug and take effects, so as other 
executors.

> Dynamic Log Level Settings for executors
> 
>
> Key: SPARK-21775
> URL: https://issues.apache.org/jira/browse/SPARK-21775
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.2.0
>Reporter: LvDongrong
> Attachments: terminal.PNG, web.PNG
>
>
> Someimes we want to change the log level of executor when our application has 
> already deployed, to see detail infomation or decrease the log items. 
> Changing the log4j configure file is not convenient,so We add the ability to 
> set log level settings for a running executor. It can also reverts the log 
> level back to what it was before you added the setting.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21775) Dynamic Log Level Settings for executors

2017-08-17 Thread LvDongrong (JIRA)
LvDongrong created SPARK-21775:
--

 Summary: Dynamic Log Level Settings for executors
 Key: SPARK-21775
 URL: https://issues.apache.org/jira/browse/SPARK-21775
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Web UI
Affects Versions: 2.2.0
Reporter: LvDongrong


Someimes we want to change the log level of executor when our application has 
already deployed, to see detail infomation or decrease the log items. Changing 
the log4j configure file is not convenient,so We add the ability to set log 
level settings for a running executor. It can also reverts the log level back 
to what it was before you added the setting.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21774) The rule PromoteStrings cast string to a wrong data type

2017-08-17 Thread StanZhai (JIRA)
StanZhai created SPARK-21774:


 Summary: The rule PromoteStrings cast string to a wrong data type
 Key: SPARK-21774
 URL: https://issues.apache.org/jira/browse/SPARK-21774
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: StanZhai
Priority: Critical


Data
{code}
create temporary view tb as select * from values
("0", 1),
("-0.1", 2),
("1", 3)
as grouping(a, b)
{code}

SQL:
{code}
select a, b from tb where a=0
{code}

The result which is wrong:
{code}
++---+
|   a|  b|
++---+
|   0|  1|
|-0.1|  2|
++---+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21773) Should Install mkdocs if missing in the path in SQL documentation build

2017-08-17 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-21773:


 Summary: Should Install mkdocs if missing in the path in SQL 
documentation build
 Key: SPARK-21773
 URL: https://issues.apache.org/jira/browse/SPARK-21773
 Project: Spark
  Issue Type: Improvement
  Components: Build, Documentation
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon
Priority: Minor


Currently, SQL documentation build with Jekins looks being failed due to 
missing {{mkdocs}}.

For example, see 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-docs/3580/console

{code}
...
Moving back into docs dir.
Moving to SQL directory and building docs.
Missing mkdocs in your path, skipping SQL documentation generation.
Moving back into docs dir.
Making directory api/sql
cp -r ../sql/site/. api/sql
jekyll 2.5.3 | Error:  unknown file type: ../sql/site/.
Deleting credential directory 
/home/jenkins/workspace/spark-master-docs/spark-utils/new-release-scripts/jenkins/jenkins-credentials-scUXuITy
Build step 'Execute shell' marked build as failure
[WS-CLEANUP] Deleting project workspace...[WS-CLEANUP] done
Finished: FAILURE
{code}

It looks we better install it if missing rather than fail it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21772) HiveException unable to move results from srcf to destf in InsertIntoHiveTable

2017-08-17 Thread liupengcheng (JIRA)
liupengcheng created SPARK-21772:


 Summary: HiveException unable to move results from srcf to destf 
in InsertIntoHiveTable 
 Key: SPARK-21772
 URL: https://issues.apache.org/jira/browse/SPARK-21772
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.1
 Environment: JDK1.7
CentOS 6.3
Spark2.1
Reporter: liupengcheng


Currently, when execute {code:java} create table as select {code} would return 
Exception:

{code:java}
2017-08-17,16:14:18,792 ERROR 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation: Error 
executing query, currentState RUNNING,
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.sql.hive.client.Shim_v0_12.loadTable(HiveShim.scala:346)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply$mcV$sp(HiveClientImpl.scala:770)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:770)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$loadTable$1.apply(HiveClientImpl.scala:770)
at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:316)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:262)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:261)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:305)
at 
org.apache.spark.sql.hive.client.HiveClientImpl.loadTable(HiveClientImpl.scala:769)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply$mcV$sp(HiveExternalCatalog.scala:765)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:763)
at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$loadTable$1.apply(HiveExternalCatalog.scala:763)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:100)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:763)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:323)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:170)
at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:347)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:120)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:120)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:119)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at 
org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:92)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:120)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:120)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:141)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:138)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:119)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.Dataset.(Dataset.scala:187)
at org.apache.spark.sql.Dataset$.ofR

[jira] [Created] (SPARK-21771) SparkSQLEnv creates a useless meta hive client

2017-08-17 Thread Kent Yao (JIRA)
Kent Yao created SPARK-21771:


 Summary: SparkSQLEnv creates a useless meta hive client
 Key: SPARK-21771
 URL: https://issues.apache.org/jira/browse/SPARK-21771
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Kent Yao
Priority: Minor


Once a meta hive client is created, it generates its SessionState which creates 
a lot of session related directories, some deleteOnExit, some does not. if a 
hive client is useless we may not create it at the very start.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-17 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21770:
---
Description: Given an n-element raw prediction vector of all-zeros, 
ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output a 
probability vector of all-equal 1/n entries  (was: Given a raw prediction 
vector of all-zeros, 
ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output a 
probability vector that predicts each class with equal probability (1 / 
numClasses).)

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-17 Thread Siddharth Murching (JIRA)
Siddharth Murching created SPARK-21770:
--

 Summary: ProbabilisticClassificationModel: Improve normalization 
of all-zero raw predictions
 Key: SPARK-21770
 URL: https://issues.apache.org/jira/browse/SPARK-21770
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Siddharth Murching
Priority: Minor


Given a raw prediction vector of all-zeros, 
ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output a 
probability vector that predicts each class with equal probability (1 / 
numClasses).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when PartitionBy Used

2017-08-17 Thread George Pongracz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131537#comment-16131537
 ] 

George Pongracz edited comment on SPARK-21702 at 8/18/17 12:55 AM:
---

*Update:*

The data bearing files (files that contain the data payload from the stream) 
written to s3 when viewed through the AWS S3 GUI and selected using their LHS 
check-box encryption in the properties section report their encryption as "-".

Data bearing files if written without using "PartitionBy", when viewed through 
the AWS S3 GUI and selected using their LHS check-box encryption in the 
properties section report their encryption as "AES-256".

All related non-data bearing files, irrespective whether "PartitionBy" has been 
or not been used when selected using their LHS check-box encryption in the 
properties section report their encryption  as "AES-256".

When clicking through the name of a single data bearing file, when 
"PartitionBy" has been used, brings up a dedicated overview screen for the 
file, reports it as having AES-256 encryption, which differs from how its 
reported with encryption "-" in the parent screen and selected using its LHS 
check-box.

As one can see, this labelling of encryption is inconsistent and can cause 
confusion that a file on first inspection seems unencrypted, whilst really the 
files on deeper via click-through report as encrypted.

I think this lowers the weight of this issue and I can close if deemed a non 
issue, however it would be good if the files would written would all present 
consistently and correctly, whether data or non-data bearing. 

I must say I spun my wheels for a time believing I had not encrypted and trying 
to debug until I stumbled upon what I just described in this update.

Thanks for the advice about the s3a.impl field :)


was (Author: gpongracz):
*Update:*

The data bearing files (files that contain the data payload from the stream) 
written to s3 when viewed through the AWS S3 GUI and selected using their LHS 
check-box encryption in the properties section report their encryption as "-".

Data bearing files if written without using "PartitionBy", when viewed through 
the AWS S3 GUI and selected using their LHS check-box encryption in the 
properties section report their encryption as "AES-256".

All related non-data bearing files, irrespective whether "PartitionBy" has been 
or not been used when selected using their LHS check-box encryption in the 
properties section report their encryption  as "AES-256".

When clicking through the name of a single data bearing file, when 
"PartitionBy" has been used, brings up a dedicated overview screen for the 
file, reports it as having AES-256 encryption, which differs from how its 
reported with encryption "-" in the parent screen and selected using its LHS 
check-box.

As one can see, this labelling of encryption is inconsistent and can cause 
confusion that a file on first inspection seems unencrypted, whilst really the 
files on deeper via click-through report as encrypted.

I think this lowers the weight of this issue and I can close if deemed a non 
issue, however it would be good if the files would written would all present 
consistently and correctly, whether data or non-data bearing. 

I must say I spun my wheels for a time believing I had not encrypted and trying 
to debug until I stumbled upon what I just described in this update.

Thanks for the advice about the s3a.impl field

> Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when 
> PartitionBy Used
> 
>
> Key: SPARK-21702
> URL: https://issues.apache.org/jira/browse/SPARK-21702
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Hadoop 2.7.3: AWS SDK 1.7.4
> Hadoop 2.8.1: AWS SDK 1.10.6
>Reporter: George Pongracz
>Priority: Minor
>  Labels: security
>
> Settings:
>   .config("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>   .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", 
> "AES256")
> When writing to an S3 sink from structured streaming the files are being 
> encrypted using AES-256
> When introducing a "PartitionBy" the output data files are unencrypted. 
> All other supporting files, metadata are encrypted
> Suspect write to temp is encrypted and move/rename is not applying the SSE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when PartitionBy Used

2017-08-17 Thread George Pongracz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131537#comment-16131537
 ] 

George Pongracz edited comment on SPARK-21702 at 8/18/17 12:55 AM:
---

*Update:*

The data bearing files (files that contain the data payload from the stream) 
written to s3 when viewed through the AWS S3 GUI and selected using their LHS 
check-box encryption in the properties section report their encryption as "-".

Data bearing files if written without using "PartitionBy", when viewed through 
the AWS S3 GUI and selected using their LHS check-box encryption in the 
properties section report their encryption as "AES-256".

All related non-data bearing files, irrespective whether "PartitionBy" has been 
or not been used when selected using their LHS check-box encryption in the 
properties section report their encryption  as "AES-256".

When clicking through the name of a single data bearing file, when 
"PartitionBy" has been used, brings up a dedicated overview screen for the 
file, reports it as having AES-256 encryption, which differs from how its 
reported with encryption "-" in the parent screen and selected using its LHS 
check-box.

As one can see, this labelling of encryption is inconsistent and can cause 
confusion that a file on first inspection seems unencrypted, whilst really the 
files on deeper via click-through report as encrypted.

I think this lowers the weight of this issue and I can close if deemed a non 
issue, however it would be good if the files would written would all present 
consistently and correctly, whether data or non-data bearing. 

I must say I spun my wheels for a time believing I had not encrypted and trying 
to debug until I stumbled upon what I just described in this update.

Thanks for the advice about the s3a.impl field


was (Author: gpongracz):
*Update:*

The data bearing files (files that contain the data payload from the stream) 
written to s3 when viewed through the AWS S3 GUI and selected using their LHS 
check-box encryption in the properties section report their encryption as "-".

Data bearing files if written without using "PartitionBy", when viewed through 
the AWS S3 GUI and selected using their LHS check-box encryption in the 
properties section report their encryption as "AES-256".

All related non-data bearing files, irrespective whether "PartitionBy" has been 
or not been used when selected using their LHS check-box encryption in the 
properties section report their encryption  as "AES-256".

When clicking through the name of a single data bearing file, when 
"PartitionBy" has been used, brings up a dedicated overview screen for the 
file, reports it as having AES-256 encryption, which differs from how its 
reported with encryption "-" in the parent screen and selected using its LHS 
check-box.

As one can see, this labelling of encryption is inconsistent and can cause 
confusion that a file on first inspection seems unencrypted, whilst really the 
files on deeper via click-through report as encrypted.

I think this lowers the weight of this issue and I can close if deemed a non 
issue, however it would be good if the files would written would all present 
consistently and correctly, whether data or non-data bearing. 

I must say I spun my wheels for a time believing I had not encrypted and trying 
to debug until I stumbled upon what I just described in this update.

> Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when 
> PartitionBy Used
> 
>
> Key: SPARK-21702
> URL: https://issues.apache.org/jira/browse/SPARK-21702
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Hadoop 2.7.3: AWS SDK 1.7.4
> Hadoop 2.8.1: AWS SDK 1.10.6
>Reporter: George Pongracz
>Priority: Minor
>  Labels: security
>
> Settings:
>   .config("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>   .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", 
> "AES256")
> When writing to an S3 sink from structured streaming the files are being 
> encrypted using AES-256
> When introducing a "PartitionBy" the output data files are unencrypted. 
> All other supporting files, metadata are encrypted
> Suspect write to temp is encrypted and move/rename is not applying the SSE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when PartitionBy Used

2017-08-17 Thread George Pongracz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131537#comment-16131537
 ] 

George Pongracz edited comment on SPARK-21702 at 8/18/17 12:54 AM:
---

*Update:*

The data bearing files (files that contain the data payload from the stream) 
written to s3 when viewed through the AWS S3 GUI and selected using their LHS 
check-box encryption in the properties section report their encryption as "-".

Data bearing files if written without using "PartitionBy", when viewed through 
the AWS S3 GUI and selected using their LHS check-box encryption in the 
properties section report their encryption as "AES-256".

All related non-data bearing files, irrespective whether "PartitionBy" has been 
or not been used when selected using their LHS check-box encryption in the 
properties section report their encryption  as "AES-256".

When clicking through the name of a single data bearing file, when 
"PartitionBy" has been used, brings up a dedicated overview screen for the 
file, reports it as having AES-256 encryption, which differs from how its 
reported with encryption "-" in the parent screen and selected using its LHS 
check-box.

As one can see, this labelling of encryption is inconsistent and can cause 
confusion that a file on first inspection seems unencrypted, whilst really the 
files on deeper via click-through report as encrypted.

I think this lowers the weight of this issue and I can close if deemed a non 
issue, however it would be good if the files would written would all present 
consistently and correctly, whether data or non-data bearing. 

I must say I spun my wheels for a time believing I had not encrypted and trying 
to debug until I stumbled upon what I just described in this update.


was (Author: gpongracz):
*Update:*

The data bearing files (files that contain the data payload from the stream) 
written to s3 when viewed through the AWS S3 GUI and selected using their LHS 
check-box encryption in the properties section report their encryption as "-".

All related non-data bearing files when selected using their LHS check-box 
encryption in the properties section report their encryption  as "AES-256".

When clicking through the name of a single data bearing file, which brings up a 
dedicated overview screen for the file, reports it as having AES-256 encryption.

As one can see, this labelling of encryption is inconsistent and can cause 
confusion that a file on first inspection seems unencrypted, whilst really the 
files on deeper via click-through report as encrypted.

I think this lowers the weight of this issue and I can close if deemed a non 
issue, however it would be good if the files would written would all present 
consistently and correctly, whether data or non-data bearing. 

I must say I lost a bit of time believing I had not encrypted and tried to 
debug until I stumbled upon what I just described in this update.

Obviously only happening when PartitionBy is used.

> Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when 
> PartitionBy Used
> 
>
> Key: SPARK-21702
> URL: https://issues.apache.org/jira/browse/SPARK-21702
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Hadoop 2.7.3: AWS SDK 1.7.4
> Hadoop 2.8.1: AWS SDK 1.10.6
>Reporter: George Pongracz
>Priority: Minor
>  Labels: security
>
> Settings:
>   .config("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>   .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", 
> "AES256")
> When writing to an S3 sink from structured streaming the files are being 
> encrypted using AES-256
> When introducing a "PartitionBy" the output data files are unencrypted. 
> All other supporting files, metadata are encrypted
> Suspect write to temp is encrypted and move/rename is not applying the SSE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when PartitionBy Used

2017-08-17 Thread George Pongracz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131537#comment-16131537
 ] 

George Pongracz edited comment on SPARK-21702 at 8/18/17 12:47 AM:
---

*Update:*

The data bearing files (files that contain the data payload from the stream) 
written to s3 when viewed through the AWS S3 GUI and selected using their LHS 
check-box encryption in the properties section report their encryption as "-".

All related non-data bearing files when selected using their LHS check-box 
encryption in the properties section report their encryption  as "AES-256".

When clicking through the name of a single data bearing file, which brings up a 
dedicated overview screen for the file, reports it as having AES-256 encryption.

As one can see, this labelling of encryption is inconsistent and can cause 
confusion that a file on first inspection seems unencrypted, whilst really the 
files on deeper via click-through report as encrypted.

I think this lowers the weight of this issue and I can close if deemed a non 
issue, however it would be good if the files would written would all present 
consistently and correctly, whether data or non-data bearing. 

I must say I lost a bit of time believing I had not encrypted and tried to 
debug until I stumbled upon what I just described in this update.

Obviously only happening when PartitionBy is used.


was (Author: gpongracz):
*Update:*

The data bearing files (files that contain the data payload from the stream) 
written to s3 when viewed through the AWS S3 GUI and selected using their LHS 
check-box encryption in the properties section report their encryption as "-".

All related non-data bearing files when selected using their LHS check-box 
encryption in the properties section report their encryption  as "AES-256".

When clicking through the name of a single data bearing file, which brings up a 
dedicated overview screen for the file, reports it as having AES-256 encryption.

As one can see, this labelling of encryption is inconsistent and can cause 
confusion that a file on first inspection seems unencrypted, whilst really the 
files on deeper via click-through report as encrypted.

I think this lowers the weight of this issue and I can close if deemed a non 
issue, however it would be good if the files would written would all present 
consistently and correctly, whether data or non-data bearing. 

Obviously only happening when PartitionBy is used.

> Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when 
> PartitionBy Used
> 
>
> Key: SPARK-21702
> URL: https://issues.apache.org/jira/browse/SPARK-21702
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Hadoop 2.7.3: AWS SDK 1.7.4
> Hadoop 2.8.1: AWS SDK 1.10.6
>Reporter: George Pongracz
>Priority: Minor
>  Labels: security
>
> Settings:
>   .config("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>   .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", 
> "AES256")
> When writing to an S3 sink from structured streaming the files are being 
> encrypted using AES-256
> When introducing a "PartitionBy" the output data files are unencrypted. 
> All other supporting files, metadata are encrypted
> Suspect write to temp is encrypted and move/rename is not applying the SSE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when PartitionBy Used

2017-08-17 Thread George Pongracz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

George Pongracz updated SPARK-21702:

Summary: Structured Streaming S3A SSE Encryption Not Visible through AWS S3 
GUI when PartitionBy Used  (was: Structured Streaming S3A SSE Encryption Not 
Applied when PartitionBy Used)

> Structured Streaming S3A SSE Encryption Not Visible through AWS S3 GUI when 
> PartitionBy Used
> 
>
> Key: SPARK-21702
> URL: https://issues.apache.org/jira/browse/SPARK-21702
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Hadoop 2.7.3: AWS SDK 1.7.4
> Hadoop 2.8.1: AWS SDK 1.10.6
>Reporter: George Pongracz
>Priority: Minor
>  Labels: security
>
> Settings:
>   .config("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>   .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", 
> "AES256")
> When writing to an S3 sink from structured streaming the files are being 
> encrypted using AES-256
> When introducing a "PartitionBy" the output data files are unencrypted. 
> All other supporting files, metadata are encrypted
> Suspect write to temp is encrypted and move/rename is not applying the SSE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used

2017-08-17 Thread George Pongracz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131537#comment-16131537
 ] 

George Pongracz edited comment on SPARK-21702 at 8/18/17 12:43 AM:
---

*Update:*

The data bearing files (files that contain the data payload from the stream) 
written to s3 when viewed through the AWS S3 GUI and selected using their LHS 
check-box encryption in the properties section report their encryption as "-".

All related non-data bearing files when selected using their LHS check-box 
encryption in the properties section report their encryption  as "AES-256".

When clicking through the name of a single data bearing file, which brings up a 
dedicated overview screen for the file, reports it as having AES-256 encryption.

As one can see, this labelling of encryption is inconsistent and can cause 
confusion that a file on first inspection seems unencrypted, whilst really the 
files on deeper via click-through report as encrypted.

I think this lowers the weight of this issue and I can close if deemed a non 
issue, however it would be good if the files would written would all present 
consistently and correctly, whether data or non-data bearing. 

Obviously only happening when PartitionBy is used.


was (Author: gpongracz):
*Update:*

The data bearing files (files that contain the data payload from the stream) 
written to s3 when viewed through the AWS S3 GUI and selected using their LHS 
check-box encryption in the properties section as "-".

All related non-data bearing files when selected using their LHS check-box 
encryption in the properties section report their encryption  as "AES-256".

When clicking through the name of a single data bearing file, which brings up a 
dedicated overview screen for the file, reports it as having AES-256 encryption.

As one can see, this labelling of encryption is inconsistent and can cause 
confusion that a file on first inspection is unencrypted. 

The good news is that the files are all encrypted underneath even if not 
appearing so at initial inspection though the AWS S3 GUI.

I think this lowers the priority of this iss and I can close if deemed a non 
issue - please advise. 

> Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used
> -
>
> Key: SPARK-21702
> URL: https://issues.apache.org/jira/browse/SPARK-21702
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Hadoop 2.7.3: AWS SDK 1.7.4
> Hadoop 2.8.1: AWS SDK 1.10.6
>Reporter: George Pongracz
>Priority: Minor
>  Labels: security
>
> Settings:
>   .config("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>   .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", 
> "AES256")
> When writing to an S3 sink from structured streaming the files are being 
> encrypted using AES-256
> When introducing a "PartitionBy" the output data files are unencrypted. 
> All other supporting files, metadata are encrypted
> Suspect write to temp is encrypted and move/rename is not applying the SSE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21702) Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used

2017-08-17 Thread George Pongracz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131537#comment-16131537
 ] 

George Pongracz commented on SPARK-21702:
-

*Update:*

The data bearing files (files that contain the data payload from the stream) 
written to s3 when viewed through the AWS S3 GUI and selected using their LHS 
check-box encryption in the properties section as "-".

All related non-data bearing files when selected using their LHS check-box 
encryption in the properties section report their encryption  as "AES-256".

When clicking through the name of a single data bearing file, which brings up a 
dedicated overview screen for the file, reports it as having AES-256 encryption.

As one can see, this labelling of encryption is inconsistent and can cause 
confusion that a file on first inspection is unencrypted. 

The good news is that the files are all encrypted underneath even if not 
appearing so at initial inspection though the AWS S3 GUI.

I think this lowers the priority of this iss and I can close if deemed a non 
issue - please advise. 

> Structured Streaming S3A SSE Encryption Not Applied when PartitionBy Used
> -
>
> Key: SPARK-21702
> URL: https://issues.apache.org/jira/browse/SPARK-21702
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
> Environment: Hadoop 2.7.3: AWS SDK 1.7.4
> Hadoop 2.8.1: AWS SDK 1.10.6
>Reporter: George Pongracz
>Priority: Minor
>  Labels: security
>
> Settings:
>   .config("spark.hadoop.fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
>   .config("spark.hadoop.fs.s3a.server-side-encryption-algorithm", 
> "AES256")
> When writing to an S3 sink from structured streaming the files are being 
> encrypted using AES-256
> When introducing a "PartitionBy" the output data files are unencrypted. 
> All other supporting files, metadata are encrypted
> Suspect write to temp is encrypted and move/rename is not applying the SSE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21685) Params isSet in scala Transformer triggered by _setDefault in pyspark

2017-08-17 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131512#comment-16131512
 ] 

Bryan Cutler commented on SPARK-21685:
--

I believe the problem is during the call to transform, the PySpark model does 
not differentiate between set and default params and then sets them all in 
Java.  I have submitting a fix at https://github.com/apache/spark/pull/18982, 
could you try that and see if it works for you?

> Params isSet in scala Transformer triggered by _setDefault in pyspark
> -
>
> Key: SPARK-21685
> URL: https://issues.apache.org/jira/browse/SPARK-21685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Ratan Rai Sur
>
> I'm trying to write a PySpark wrapper for a Transformer whose transform 
> method includes the line
> {code:java}
> require(!(isSet(outputNodeName) && isSet(outputNodeIndex)), "Can't set both 
> outputNodeName and outputNodeIndex")
> {code}
> This should only throw an exception when both of these parameters are 
> explicitly set.
> In the PySpark wrapper for the Transformer, there is this line in ___init___
> {code:java}
> self._setDefault(outputNodeIndex=0)
> {code}
> Here is the line in the main python script showing how it is being configured
> {code:java}
> cntkModel = 
> CNTKModel().setInputCol("images").setOutputCol("output").setModelLocation(spark,
>  model.uri).setOutputNodeName("z")
> {code}
> As you can see, only setOutputNodeName is being explicitly set but the 
> exception is still being thrown.
> If you need more context, 
> https://github.com/RatanRSur/mmlspark/tree/default-cntkmodel-output is the 
> branch with the code, the files I'm referring to here that are tracked are 
> the following:
> src/cntk-model/src/main/scala/CNTKModel.scala
> notebooks/tests/301 - CIFAR10 CNTK CNN Evaluation.ipynb
> The pyspark wrapper code is autogenerated



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21759) In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery

2017-08-17 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131493#comment-16131493
 ] 

Liang-Chi Hsieh commented on SPARK-21759:
-

Submitted PR at https://github.com/apache/spark/pull/18968

> In.checkInputDataTypes should not wrongly report unresolved plans for IN 
> correlated subquery
> 
>
> Key: SPARK-21759
> URL: https://issues.apache.org/jira/browse/SPARK-21759
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> With the check for structural integrity proposed in SPARK-21726, I found that 
> an optimization rule {{PullupCorrelatedPredicates}} can produce unresolved 
> plans.
> For a correlated IN query like:
> {code}
> Project [a#0]
> +- Filter a#0 IN (list#4 [b#1])
>:  +- Project [c#2]
>: +- Filter (outer(b#1) < d#3)
>:+- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
> After {{PullupCorrelatedPredicates}}, it produces query plan like:
> {code}
> 'Project [a#0]
> +- 'Filter a#0 IN (list#4 [(b#1 < d#3)])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
> Because the correlated predicate involves another attribute {{d#3}} in 
> subquery, it has been pulled out and added into the {{Project}} on the top of 
> the subquery.
> When {{list}} in {{In}} contains just one {{ListQuery}}, 
> {{In.checkInputDataTypes}} checks if the size of {{value}} expressions 
> matches the output size of subquery. In the above example, there is only 
> {{value}} expression and the subquery output has two attributes {{c#2, d#3}}, 
> so it fails the check and {{In.resolved}} returns {{false}}.
> We should not let {{In.checkInputDataTypes}} wrongly report unresolved plans 
> to fail the structural integrity check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21759) In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery

2017-08-17 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-21759:

Description: 
With the check for structural integrity proposed in SPARK-21726, I found that 
an optimization rule {{PullupCorrelatedPredicates}} can produce unresolved 
plans.

For a correlated IN query like:
{code}
Project [a#0]
+- Filter a#0 IN (list#4 [b#1])
   :  +- Project [c#2]
   : +- Filter (outer(b#1) < d#3)
   :+- LocalRelation , [c#2, d#3]
   +- LocalRelation , [a#0, b#1]
{code}

After {{PullupCorrelatedPredicates}}, it produces query plan like:
{code}
'Project [a#0]
+- 'Filter a#0 IN (list#4 [(b#1 < d#3)])
   :  +- Project [c#2, d#3]
   : +- LocalRelation , [c#2, d#3]
   +- LocalRelation , [a#0, b#1]
{code}

Because the correlated predicate involves another attribute {{d#3}} in 
subquery, it has been pulled out and added into the {{Project}} on the top of 
the subquery.

When {{list}} in {{In}} contains just one {{ListQuery}}, 
{{In.checkInputDataTypes}} checks if the size of {{value}} expressions matches 
the output size of subquery. In the above example, there is only {{value}} 
expression and the subquery output has two attributes {{c#2, d#3}}, so it fails 
the check and {{In.resolved}} returns {{false}}.

We should not let {{In.checkInputDataTypes}} wrongly report unresolved plans to 
fail the structural integrity check.



  was:
With the check for structural integrity proposed in SPARK-21726, I found that 
an optimization rule {{PullupCorrelatedPredicates}} can produce unresolved 
plans.

For a correlated IN query like:
{code}
Project [a#0]
+- Filter a#0 IN (list#4 [b#1])
   :  +- Project [c#2]
   : +- Filter (outer(b#1) < d#3)
   :+- LocalRelation , [c#2, d#3]
   +- LocalRelation , [a#0, b#1]
{code}

After {{PullupCorrelatedPredicates}}, it produces query plan like:
{code}
'Project [a#0]
+- 'Filter a#0 IN (list#4 [(b#1 < d#3)])
   :  +- Project [c#2, d#3]
   : +- LocalRelation , [c#2, d#3]
   +- LocalRelation , [a#0, b#1]
{code}

Because the correlated predicate involves another attribute {{d#3}} in 
subquery, it has been pulled out and added into the {{Project}} on the top of 
the subquery.

When {{list}} in {{In}} contains just one {{ListQuery}}, 
{{In.checkInputDataTypes}} checks if the size of {{value}} expressions matches 
the output size of subquery. In the above example, there is only {{value}} 
expression and the subquery output has two attributes {{c#2, d#3}}, so it fails 
the check and {{In.resolved}} returns {{false}}.

We should let {{PullupCorrelatedPredicates}} produce resolved plans to pass the 
structural integrity check.




> In.checkInputDataTypes should not wrongly report unresolved plans for IN 
> correlated subquery
> 
>
> Key: SPARK-21759
> URL: https://issues.apache.org/jira/browse/SPARK-21759
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> With the check for structural integrity proposed in SPARK-21726, I found that 
> an optimization rule {{PullupCorrelatedPredicates}} can produce unresolved 
> plans.
> For a correlated IN query like:
> {code}
> Project [a#0]
> +- Filter a#0 IN (list#4 [b#1])
>:  +- Project [c#2]
>: +- Filter (outer(b#1) < d#3)
>:+- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
> After {{PullupCorrelatedPredicates}}, it produces query plan like:
> {code}
> 'Project [a#0]
> +- 'Filter a#0 IN (list#4 [(b#1 < d#3)])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
> Because the correlated predicate involves another attribute {{d#3}} in 
> subquery, it has been pulled out and added into the {{Project}} on the top of 
> the subquery.
> When {{list}} in {{In}} contains just one {{ListQuery}}, 
> {{In.checkInputDataTypes}} checks if the size of {{value}} expressions 
> matches the output size of subquery. In the above example, there is only 
> {{value}} expression and the subquery output has two attributes {{c#2, d#3}}, 
> so it fails the check and {{In.resolved}} returns {{false}}.
> We should not let {{In.checkInputDataTypes}} wrongly report unresolved plans 
> to fail the structural integrity check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21759) In.checkInputDataTypes should not wrongly report unresolved plans for IN correlated subquery

2017-08-17 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-21759:

Summary: In.checkInputDataTypes should not wrongly report unresolved plans 
for IN correlated subquery  (was: PullupCorrelatedPredicates should not produce 
unresolved plan)

> In.checkInputDataTypes should not wrongly report unresolved plans for IN 
> correlated subquery
> 
>
> Key: SPARK-21759
> URL: https://issues.apache.org/jira/browse/SPARK-21759
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> With the check for structural integrity proposed in SPARK-21726, I found that 
> an optimization rule {{PullupCorrelatedPredicates}} can produce unresolved 
> plans.
> For a correlated IN query like:
> {code}
> Project [a#0]
> +- Filter a#0 IN (list#4 [b#1])
>:  +- Project [c#2]
>: +- Filter (outer(b#1) < d#3)
>:+- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
> After {{PullupCorrelatedPredicates}}, it produces query plan like:
> {code}
> 'Project [a#0]
> +- 'Filter a#0 IN (list#4 [(b#1 < d#3)])
>:  +- Project [c#2, d#3]
>: +- LocalRelation , [c#2, d#3]
>+- LocalRelation , [a#0, b#1]
> {code}
> Because the correlated predicate involves another attribute {{d#3}} in 
> subquery, it has been pulled out and added into the {{Project}} on the top of 
> the subquery.
> When {{list}} in {{In}} contains just one {{ListQuery}}, 
> {{In.checkInputDataTypes}} checks if the size of {{value}} expressions 
> matches the output size of subquery. In the above example, there is only 
> {{value}} expression and the subquery output has two attributes {{c#2, d#3}}, 
> so it fails the check and {{In.resolved}} returns {{false}}.
> We should let {{PullupCorrelatedPredicates}} produce resolved plans to pass 
> the structural integrity check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21677) json_tuple throws NullPointException when column is null as string type.

2017-08-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21677.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> json_tuple throws NullPointException when column is null as string type.
> 
>
> Key: SPARK-21677
> URL: https://issues.apache.org/jira/browse/SPARK-21677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Jen-Ming Chung
>Priority: Minor
>  Labels: Starter
> Fix For: 2.3.0
>
>
> I was testing {{json_tuple}} before using this to extract values from JSONs 
> in my testing cluster but I found it could actually throw  
> {{NullPointException}} as below sometimes:
> {code}
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Hyukjin'))").show()
> +---+
> | c0|
> +---+
> |224|
> +---+
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Jackson'))").show()
> ++
> |  c0|
> ++
> |null|
> ++
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show()
> ...
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:367)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames$lzycompute(jsonExpressions.scala:366)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames(jsonExpressions.scala:365)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields$lzycompute(jsonExpressions.scala:373)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields(jsonExpressions.scala:373)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.org$apache$spark$sql$catalyst$expressions$JsonTuple$$parseRow(jsonExpressions.scala:417)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:401)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:400)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2559)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.eval(jsonExpressions.scala:400)
> {code}
> It sounds we should show explicit error messages or return {{NULL}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21677) json_tuple throws NullPointException when column is null as string type.

2017-08-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-21677:
---

Assignee: Jen-Ming Chung

> json_tuple throws NullPointException when column is null as string type.
> 
>
> Key: SPARK-21677
> URL: https://issues.apache.org/jira/browse/SPARK-21677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Jen-Ming Chung
>Priority: Minor
>  Labels: Starter
>
> I was testing {{json_tuple}} before using this to extract values from JSONs 
> in my testing cluster but I found it could actually throw  
> {{NullPointException}} as below sometimes:
> {code}
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Hyukjin'))").show()
> +---+
> | c0|
> +---+
> |224|
> +---+
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(' Jackson'))").show()
> ++
> |  c0|
> ++
> |null|
> ++
> scala> Seq(("""{"Hyukjin": 224, "John": 
> 1225}""")).toDS.selectExpr("json_tuple(value, trim(null))").show()
> ...
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:367)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$foldableFieldNames$1.apply(jsonExpressions.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames$lzycompute(jsonExpressions.scala:366)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.foldableFieldNames(jsonExpressions.scala:365)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields$lzycompute(jsonExpressions.scala:373)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.constantFields(jsonExpressions.scala:373)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.org$apache$spark$sql$catalyst$expressions$JsonTuple$$parseRow(jsonExpressions.scala:417)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:401)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple$$anonfun$eval$4.apply(jsonExpressions.scala:400)
>   at org.apache.spark.util.Utils$.tryWithResource(Utils.scala:2559)
>   at 
> org.apache.spark.sql.catalyst.expressions.JsonTuple.eval(jsonExpressions.scala:400)
> {code}
> It sounds we should show explicit error messages or return {{NULL}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21767) Add Decimal Test For Avro in VersionSuite

2017-08-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21767.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add Decimal Test For Avro in VersionSuite
> -
>
> Key: SPARK-21767
> URL: https://issues.apache.org/jira/browse/SPARK-21767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> Decimal is a logical type of AVRO. We need to ensure the support of Hive's 
> AVRO serde works well in Spark



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21769) Add a table property for Hive-serde tables to control Spark always respecting schemas inferred by Spark SQL

2017-08-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21769:

Summary: Add a table property for Hive-serde tables to control Spark always 
respecting schemas inferred by Spark SQL  (was: Add a table property for 
Hive-serde tables to controlling Spark always respecting schemas inferred by 
Spark SQL)

> Add a table property for Hive-serde tables to control Spark always respecting 
> schemas inferred by Spark SQL
> ---
>
> Key: SPARK-21769
> URL: https://issues.apache.org/jira/browse/SPARK-21769
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> For Hive-serde tables, we always respect the schema stored in Hive metastore, 
> because the schema could be altered by the other engines that share the same 
> metastore. Thus, we always trust the metastore-controlled schema for 
> Hive-serde tables when the schemas are different (without considering the 
> nullability and cases). However, in some scenarios, Hive metastore also could 
> INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in 
> serde are different. 
> The proposed solution is to introduce a table property for such scenarios. 
> For a specific Hive-serde table, users can manually setting such table 
> property for asking Spark for always respect Spark-inferred schema instead of 
> trusting metastore-controlled schema. By default, it is off. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21769) Add a table property for Hive-serde tables to controlling Spark always respecting schemas inferred by Spark SQL

2017-08-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21769:

Issue Type: Improvement  (was: Bug)

> Add a table property for Hive-serde tables to controlling Spark always 
> respecting schemas inferred by Spark SQL
> ---
>
> Key: SPARK-21769
> URL: https://issues.apache.org/jira/browse/SPARK-21769
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> For Hive-serde tables, we always respect the schema stored in Hive metastore, 
> because the schema could be altered by the other engines that share the same 
> metastore. Thus, we always trust the metastore-controlled schema for 
> Hive-serde tables when the schemas are different (without considering the 
> nullability and cases). However, in some scenarios, Hive metastore also could 
> INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in 
> serde are different. 
> The proposed solution is to introduce a table property for such scenarios. 
> For a specific Hive-serde table, users can manually setting such table 
> property for asking Spark for always respect Spark-inferred schema instead of 
> trusting metastore-controlled schema. By default, it is off. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21769) Add a table property for Hive-serde tables to controlling Spark always respecting schemas inferred by Spark SQL

2017-08-17 Thread Xiao Li (JIRA)
Xiao Li created SPARK-21769:
---

 Summary: Add a table property for Hive-serde tables to controlling 
Spark always respecting schemas inferred by Spark SQL
 Key: SPARK-21769
 URL: https://issues.apache.org/jira/browse/SPARK-21769
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li


For Hive-serde tables, we always respect the schema stored in Hive metastore, 
because the schema could be altered by the other engines that share the same 
metastore. Thus, we always trust the metastore-controlled schema for Hive-serde 
tables when the schemas are different (without considering the nullability and 
cases). However, in some scenarios, Hive metastore also could INCORRECTLY 
overwrite the schemas when the serde and Hive metastore built-in serde are 
different. 

The proposed solution is to introduce a table property for such scenarios. For 
a specific Hive-serde table, users can manually setting such table property for 
asking Spark for always respect Spark-inferred schema instead of trusting 
metastore-controlled schema. By default, it is off. 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21769) Add a table property for Hive-serde tables to make Spark always respect schemas inferred by Spark SQL

2017-08-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21769:

Summary: Add a table property for Hive-serde tables to make Spark always 
respect schemas inferred by Spark SQL  (was: Add a table property for 
Hive-serde tables to control Spark always respecting schemas inferred by Spark 
SQL)

> Add a table property for Hive-serde tables to make Spark always respect 
> schemas inferred by Spark SQL
> -
>
> Key: SPARK-21769
> URL: https://issues.apache.org/jira/browse/SPARK-21769
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> For Hive-serde tables, we always respect the schema stored in Hive metastore, 
> because the schema could be altered by the other engines that share the same 
> metastore. Thus, we always trust the metastore-controlled schema for 
> Hive-serde tables when the schemas are different (without considering the 
> nullability and cases). However, in some scenarios, Hive metastore also could 
> INCORRECTLY overwrite the schemas when the serde and Hive metastore built-in 
> serde are different. 
> The proposed solution is to introduce a table property for such scenarios. 
> For a specific Hive-serde table, users can manually setting such table 
> property for asking Spark for always respect Spark-inferred schema instead of 
> trusting metastore-controlled schema. By default, it is off. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16742) Kerberos support for Spark on Mesos

2017-08-17 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-16742.

   Resolution: Fixed
 Assignee: Arthur Rand
Fix Version/s: 2.3.0

[~arand] I don't see a specific bug tracking long-running app support (i.e. 
principal/keytab arguments to spark-submit), which I don't think this patch 
covered, so you may want to look at filing something to track that.

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>Assignee: Arthur Rand
> Fix For: 2.3.0
>
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19747) Consolidate code in ML aggregators

2017-08-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131340#comment-16131340
 ] 

Joseph K. Bradley commented on SPARK-19747:
---

Just saying: Thanks a lot for doing this reorg!  It's a nice step towards 
having pluggable algorithms.

> Consolidate code in ML aggregators
> --
>
> Key: SPARK-19747
> URL: https://issues.apache.org/jira/browse/SPARK-19747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Seth Hendrickson
>Priority: Minor
>
> Many algorithms in Spark ML are posed as optimization of a differentiable 
> loss function over a parameter vector. We implement these by having a loss 
> function accumulate the gradient using an Aggregator class which has methods 
> that amount to a {{seqOp}} and {{combOp}}. So, pretty much every algorithm 
> that obeys this form implements a cost function class and an aggregator 
> class, which are completely separate from one another but share probably 80% 
> of the same code. 
> I think it is important to clean things like this up, and if we can do it 
> properly it will make the code much more maintainable, readable, and bug 
> free. It will also help reduce the overhead of future implementations.
> The design is of course open for discussion, but I think we should aim to:
> 1. Have all aggregators share parent classes, so that they only need to 
> implement the {{add}} function. This is really the only difference in the 
> current aggregators.
> 2. Have a single, generic cost function that is parameterized by the 
> aggregator type. This reduces the many places we implement cost functions and 
> greatly reduces the amount of duplicated code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4131) Support "Writing data into the filesystem from queries"

2017-08-17 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131315#comment-16131315
 ] 

Xiao Li commented on SPARK-4131:


https://github.com/apache/spark/pull/18975

> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Assignee: Fei Wang
>Priority: Critical
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
> from page_views;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4131) Support "Writing data into the filesystem from queries"

2017-08-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-4131:
---
Target Version/s: 2.3.0

> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Assignee: Fei Wang
>Priority: Critical
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
> from page_views;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18394) Executing the same query twice in a row results in CodeGenerator cache misses

2017-08-17 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-18394.
---
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.3.0

> Executing the same query twice in a row results in CodeGenerator cache misses
> -
>
> Key: SPARK-18394
> URL: https://issues.apache.org/jira/browse/SPARK-18394
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: HiveThriftServer2 running on branch-2.0 on Mac laptop
>Reporter: Jonny Serencsa
>Assignee: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> Executing the query:
> {noformat}
> select
> l_returnflag,
> l_linestatus,
> sum(l_quantity) as sum_qty,
> sum(l_extendedprice) as sum_base_price,
> sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
> sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
> avg(l_quantity) as avg_qty,
> avg(l_extendedprice) as avg_price,
> avg(l_discount) as avg_disc,
> count(*) as count_order
> from
> lineitem_1_row
> where
> l_shipdate <= date_sub('1998-12-01', '90')
> group by
> l_returnflag,
> l_linestatus
> ;
> {noformat}
> twice (in succession), will result in CodeGenerator cache misses in BOTH 
> executions. Since the query is identical, I would expect the same code to be 
> generated. 
> Turns out, the generated code is not exactly the same, resulting in cache 
> misses when performing the lookup in the CodeGenerator cache. Yet, the code 
> is equivalent. 
> Below is (some portion of the) generated code for two runs of the query:
> run-1
> {noformat}
> import java.nio.ByteBuffer;
> import java.nio.ByteOrder;
> import scala.collection.Iterator;
> import org.apache.spark.sql.types.DataType;
> import org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder;
> import org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter;
> import org.apache.spark.sql.execution.columnar.MutableUnsafeRow;
> public SpecificColumnarIterator generate(Object[] references) {
> return new SpecificColumnarIterator();
> }
> class SpecificColumnarIterator extends 
> org.apache.spark.sql.execution.columnar.ColumnarIterator {
> private ByteOrder nativeOrder = null;
> private byte[][] buffers = null;
> private UnsafeRow unsafeRow = new UnsafeRow(7);
> private BufferHolder bufferHolder = new BufferHolder(unsafeRow);
> private UnsafeRowWriter rowWriter = new UnsafeRowWriter(bufferHolder, 7);
> private MutableUnsafeRow mutableRow = null;
> private int currentRow = 0;
> private int numRowsInBatch = 0;
> private scala.collection.Iterator input = null;
> private DataType[] columnTypes = null;
> private int[] columnIndexes = null;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor accessor;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor 
> accessor1;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor 
> accessor2;
> private org.apache.spark.sql.execution.columnar.StringColumnAccessor 
> accessor3;
> private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor 
> accessor4;
> private org.apache.spark.sql.execution.columnar.StringColumnAccessor 
> accessor5;
> private org.apache.spark.sql.execution.columnar.StringColumnAccessor 
> accessor6;
> public SpecificColumnarIterator() {
> this.nativeOrder = ByteOrder.nativeOrder();
> this.buffers = new byte[7][];
> this.mutableRow = new MutableUnsafeRow(rowWriter);
> }
> public void initialize(Iterator input, DataType[] columnTypes, int[] 
> columnIndexes) {
> this.input = input;
> this.columnTypes = columnTypes;
> this.columnIndexes = columnIndexes;
> }
> public boolean hasNext() {
> if (currentRow < numRowsInBatch) {
> return true;
> }
> if (!input.hasNext()) {
> return false;
> }
> org.apache.spark.sql.execution.columnar.CachedBatch batch = 
> (org.apache.spark.sql.execution.columnar.CachedBatch) input.next();
> currentRow = 0;
> numRowsInBatch = batch.numRows();
> for (int i = 0; i < columnIndexes.length; i ++) {
> buffers[i] = batch.buffers()[columnIndexes[i]];
> }
> accessor = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[0]).order(nativeOrder));
> accessor1 = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[1]).order(nativeOrder));
> accessor2 = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[2]).order(nativeOrder));
> accessor3 = new 
> org.apache.spark.sql.execution.columnar.StringColumnAccessor(ByteBuffer.wrap(buffers[3]).order(nativeOrder));
> accessor4 = new 
> org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[4]).order(nativeOrder));
> accessor5 = ne

[jira] [Created] (SPARK-21768) spark.csv.read Empty String Parsed as NULL when nullValue is Set

2017-08-17 Thread Andrew Gross (JIRA)
Andrew Gross created SPARK-21768:


 Summary: spark.csv.read Empty String Parsed as NULL when nullValue 
is Set
 Key: SPARK-21768
 URL: https://issues.apache.org/jira/browse/SPARK-21768
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.2.0, 2.0.2
 Environment: AWS EMR Spark 2.2.0 (also Spark 2.0.2)
PySpark

Reporter: Andrew Gross


In a CSV with quoted fields, empty strings will be interpreted as NULL even 
when a nullValue is explicitly set:

Example CSV with Quoted Fields, Delimiter | and nullValue XXNULLXX

{{"XXNULLXX"|""|"XXNULLXX"|"foo"}}

PySpark Script to load the file (from S3):

{code:title=load.py|borderStyle=solid}
from pyspark.sql import SparkSession
from pyspark.sql.types import StringType, StructField, StructType

spark = SparkSession.builder.appName("test_csv").getOrCreate()

fields = []
fields.append(StructField("First Null Field", StringType(), True))
fields.append(StructField("Empty String Field", StringType(), True))
fields.append(StructField("Second Null Field", StringType(), True))
fields.append(StructField("Non Empty String Field", StringType(), True))
schema = StructType(fields)

keys = ['s3://mybucket/test/demo.csv']

bad_data = spark.read.csv(keys, timestampFormat="-MM-dd HH:mm:ss", 
mode="FAILFAST", sep="|", nullValue="XXNULLXX", schema=schema)
bad_data.show()
{code}

Output
{noformat}
++--+-+--+
|First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
++--+-+--+
|null|  null| null|   foo|
++--+-+--+
{noformat}

Expected Output:
{noformat}
++--+-+--+
|First Null Field|Empty String Field|Second Null Field|Non Empty String Field|
++--+-+--+
|null|  | null|   foo|
++--+-+--+
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21762) FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new file isn't yet visible

2017-08-17 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131190#comment-16131190
 ] 

Steve Loughran commented on SPARK-21762:


SPARK-20703 simplifies this, especially testing, as it's isolated from 
FileFormatWriter. Same problem exists though: if you are getting any Create 
inconsistency, metrics probes trigger failures which may not be present by the 
time task commit actually takes place

> FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new 
> file isn't yet visible
> 
>
> Key: SPARK-21762
> URL: https://issues.apache.org/jira/browse/SPARK-21762
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: object stores without complete creation consistency 
> (this includes AWS S3's caching of negative GET results)
>Reporter: Steve Loughran
>Priority: Minor
>
> The metrics collection of SPARK-20703 can trigger premature failure if the 
> newly written object isn't actually visible yet, that is if, after 
> {{writer.close()}}, a {{getFileStatus(path)}} returns a 
> {{FileNotFoundException}}.
> Strictly speaking, not having a file immediately visible goes against the 
> fundamental expectations of the Hadoop FS APIs, namely full consistent data & 
> medata across all operations, with immediate global visibility of all 
> changes. However, not all object stores make that guarantee, be it only newly 
> created data or updated blobs. And so spurious FNFEs can get raised, ones 
> which *should* have gone away by the time the actual task is committed. Or if 
> they haven't, the job is in such deep trouble.
> What to do?
> # leave as is: fail fast & so catch blobstores/blobstore clients which don't 
> behave as required. One issue here: will that trigger retries, what happens 
> there, etc, etc.
> # Swallow the FNFE and hope the file is observable later.
> # Swallow all IOEs and hope that whatever problem the FS has is transient.
> Options 2 & 3 aren't going to collect metrics in the event of a FNFE, or at 
> least, not the counter of bytes written.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21762) FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new file isn't yet visible

2017-08-17 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131190#comment-16131190
 ] 

Steve Loughran edited comment on SPARK-21762 at 8/17/17 7:41 PM:
-

SPARK-21669 simplifies this, especially testing, as it's isolated from 
FileFormatWriter. Same problem exists though: if you are getting any Create 
inconsistency, metrics probes trigger failures which may not be present by the 
time task commit actually takes place


was (Author: ste...@apache.org):
SPARK-20703 simplifies this, especially testing, as it's isolated from 
FileFormatWriter. Same problem exists though: if you are getting any Create 
inconsistency, metrics probes trigger failures which may not be present by the 
time task commit actually takes place

> FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new 
> file isn't yet visible
> 
>
> Key: SPARK-21762
> URL: https://issues.apache.org/jira/browse/SPARK-21762
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: object stores without complete creation consistency 
> (this includes AWS S3's caching of negative GET results)
>Reporter: Steve Loughran
>Priority: Minor
>
> The metrics collection of SPARK-20703 can trigger premature failure if the 
> newly written object isn't actually visible yet, that is if, after 
> {{writer.close()}}, a {{getFileStatus(path)}} returns a 
> {{FileNotFoundException}}.
> Strictly speaking, not having a file immediately visible goes against the 
> fundamental expectations of the Hadoop FS APIs, namely full consistent data & 
> medata across all operations, with immediate global visibility of all 
> changes. However, not all object stores make that guarantee, be it only newly 
> created data or updated blobs. And so spurious FNFEs can get raised, ones 
> which *should* have gone away by the time the actual task is committed. Or if 
> they haven't, the job is in such deep trouble.
> What to do?
> # leave as is: fail fast & so catch blobstores/blobstore clients which don't 
> behave as required. One issue here: will that trigger retries, what happens 
> there, etc, etc.
> # Swallow the FNFE and hope the file is observable later.
> # Swallow all IOEs and hope that whatever problem the FS has is transient.
> Options 2 & 3 aren't going to collect metrics in the event of a FNFE, or at 
> least, not the counter of bytes written.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21762) FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new file isn't yet visible

2017-08-17 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-21762:
---
Summary: FileFormatWriter/BasicWriteTaskStatsTracker metrics collection 
fails if a new file isn't yet visible  (was: FileFormatWriter metrics 
collection fails if a newly close()d file isn't yet visible)

> FileFormatWriter/BasicWriteTaskStatsTracker metrics collection fails if a new 
> file isn't yet visible
> 
>
> Key: SPARK-21762
> URL: https://issues.apache.org/jira/browse/SPARK-21762
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: object stores without complete creation consistency 
> (this includes AWS S3's caching of negative GET results)
>Reporter: Steve Loughran
>Priority: Minor
>
> The metrics collection of SPARK-20703 can trigger premature failure if the 
> newly written object isn't actually visible yet, that is if, after 
> {{writer.close()}}, a {{getFileStatus(path)}} returns a 
> {{FileNotFoundException}}.
> Strictly speaking, not having a file immediately visible goes against the 
> fundamental expectations of the Hadoop FS APIs, namely full consistent data & 
> medata across all operations, with immediate global visibility of all 
> changes. However, not all object stores make that guarantee, be it only newly 
> created data or updated blobs. And so spurious FNFEs can get raised, ones 
> which *should* have gone away by the time the actual task is committed. Or if 
> they haven't, the job is in such deep trouble.
> What to do?
> # leave as is: fail fast & so catch blobstores/blobstore clients which don't 
> behave as required. One issue here: will that trigger retries, what happens 
> there, etc, etc.
> # Swallow the FNFE and hope the file is observable later.
> # Swallow all IOEs and hope that whatever problem the FS has is transient.
> Options 2 & 3 aren't going to collect metrics in the event of a FNFE, or at 
> least, not the counter of bytes written.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21767) Add Decimal Test For Avro in VersionSuite

2017-08-17 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131028#comment-16131028
 ] 

Xiao Li commented on SPARK-21767:
-

https://github.com/apache/spark/pull/18977


> Add Decimal Test For Avro in VersionSuite
> -
>
> Key: SPARK-21767
> URL: https://issues.apache.org/jira/browse/SPARK-21767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Decimal is a logical type of AVRO. We need to ensure the support of Hive's 
> AVRO serde works well in Spark



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21767) Add Decimal Test For Avro in VersionSuite

2017-08-17 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21767:

Description: Decimal is a logical type of AVRO. We need to ensure the 
support of Hive's AVRO serde works well in Spark

> Add Decimal Test For Avro in VersionSuite
> -
>
> Key: SPARK-21767
> URL: https://issues.apache.org/jira/browse/SPARK-21767
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Decimal is a logical type of AVRO. We need to ensure the support of Hive's 
> AVRO serde works well in Spark



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21767) Add Decimal Test For Avro in VersionSuite

2017-08-17 Thread Xiao Li (JIRA)
Xiao Li created SPARK-21767:
---

 Summary: Add Decimal Test For Avro in VersionSuite
 Key: SPARK-21767
 URL: https://issues.apache.org/jira/browse/SPARK-21767
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Xiao Li
Assignee: Xiao Li






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21766) DataFrame toPandas() raises ValueError with nullable int columns

2017-08-17 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21766:
-
Summary: DataFrame toPandas() raises ValueError with nullable int columns  
(was: DataFrame toPandas() )

> DataFrame toPandas() raises ValueError with nullable int columns
> 
>
> Key: SPARK-21766
> URL: https://issues.apache.org/jira/browse/SPARK-21766
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> When calling {{DataFrame.toPandas()}} (without Arrow enabled), if there is a 
> IntegerType column that has null values the following exception is thrown:
> {noformat}
> ValueError: Cannot convert non-finite values (NA or inf) to integer
> {noformat}
> This is because the null values first get converted to float NaN during the 
> construction of the Pandas DataFrame in {{from_records}}, and then it is 
> attempted to be converted back to to an integer where it fails.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21766) DataFrame toPandas()

2017-08-17 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-21766:


 Summary: DataFrame toPandas() 
 Key: SPARK-21766
 URL: https://issues.apache.org/jira/browse/SPARK-21766
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Bryan Cutler


When calling {{DataFrame.toPandas()}} (without Arrow enabled), if there is a 
IntegerType column that has null values the following exception is thrown:

{noformat}
ValueError: Cannot convert non-finite values (NA or inf) to integer
{noformat}

This is because the null values first get converted to float NaN during the 
construction of the Pandas DataFrame in {{from_records}}, and then it is 
attempted to be converted back to to an integer where it fails.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15689) Data source API v2

2017-08-17 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130863#comment-16130863
 ] 

Wenchen Fan edited comment on SPARK-15689 at 8/17/17 5:25 PM:
--

google doc attached!


was (Author: cloud_fan):
good doc attached!

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15689) Data source API v2

2017-08-17 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130863#comment-16130863
 ] 

Wenchen Fan commented on SPARK-15689:
-

good doc attached!

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17414) Set type is not supported for creating data frames

2017-08-17 Thread Alexander Bessonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130852#comment-16130852
 ] 

Alexander Bessonov commented on SPARK-17414:


Fixed in SPARK-21204

> Set type is not supported for creating data frames
> --
>
> Key: SPARK-17414
> URL: https://issues.apache.org/jira/browse/SPARK-17414
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Emre Colak
>Priority: Minor
>
> For a case class that has a field of type Set, createDataFrame() method 
> throws an exception saying "Schema for type Set is not supported". Exception 
> is raised by the org.apache.spark.sql.catalyst.ScalaReflection class where 
> Array, Seq and Map types are supported but Set is not. It would be nice to 
> support Set here by default instead of having to write a custom Encoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21765) Mark all streaming plans as isStreaming

2017-08-17 Thread Jose Torres (JIRA)
Jose Torres created SPARK-21765:
---

 Summary: Mark all streaming plans as isStreaming
 Key: SPARK-21765
 URL: https://issues.apache.org/jira/browse/SPARK-21765
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Structured Streaming
Affects Versions: 2.2.0
Reporter: Jose Torres


LogicalPlan has an isStreaming bit, but it's incompletely implemented. Some 
streaming sources don't set the bit, and the bit can sometimes be lost in 
rewriting. Setting the bit for all plans that are logically streaming will help 
us simplify the logic around checking query plan validity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21764) Tests failures on Windows: resources not being closed and incorrect paths

2017-08-17 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-21764:


 Summary: Tests failures on Windows: resources not being closed and 
incorrect paths
 Key: SPARK-21764
 URL: https://issues.apache.org/jira/browse/SPARK-21764
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon
Priority: Minor


This is actually a clone of https://issues.apache.org/jira/browse/SPARK-18922 
but decided to open another one here, targeting 2.3.0 as fixed version.

In short, there are many test failures on Windows, mainly due to resources not 
being closed but attempted to be removed (this is failed on Windows) and 
incorrect path inputs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21763) InferSchema option does not infer the correct schema (timestamp) from xlsx file.

2017-08-17 Thread ANSHUMAN (JIRA)
ANSHUMAN created SPARK-21763:


 Summary: InferSchema option does not infer the correct schema 
(timestamp) from xlsx file.
 Key: SPARK-21763
 URL: https://issues.apache.org/jira/browse/SPARK-21763
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
 Environment: Environment is my personal laptop.
Reporter: ANSHUMAN
Priority: Minor


I have a xlsx file containing date/time filed (My Time) in following format and 
sample records - 
5/16/2017  12:19:00 AM
5/16/2017  12:56:00 AM
5/16/2017  1:17:00 PM
5/16/2017  5:26:00 PM
5/16/2017  6:26:00 PM

I am reading the xlsx file in following manner: -

{code:java}
val inputDF = spark.sqlContext.read.format("com.crealytics.spark.excel")
.option("location","file:///C:/Users/file.xlsx")
.option("useHeader","true")
.option("treatEmptyValuesAsNulls","true")
.option("inferSchema","true")
.option("addColorColumns","false")
.load()
{code}

When I try to get schema using 
{code:java}
inputDF.printSchema()
{code}
, I get *Double*.
Sometimes, even I get the schema as *String*.

And when I print the data, I get the output as: -
+--+
|   My Time|
+--+
|42871.014189814814|
| 42871.03973379629|
|42871.553773148145|
| 42871.72765046296|
| 42871.76887731482|
+--+

Above output is clearly not correct for the given input.

Moreover, if I convert the xlsx file in csv format and read it, I get the 
output correctly. Here is the way how I read in csv format: - 

{code:java}
spark.sqlContext.read.format("csv")
  .option("header", "true")
  .option("inferSchema", true)
  .load(fileLocation)
{code}

Please look into the issue. I could not find the answer to it anywhere.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21428) CliSessionState never be recognized because of IsolatedClientLoader

2017-08-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21428:
---

Assignee: Kent Yao

> CliSessionState never be recognized because of IsolatedClientLoader
> ---
>
> Key: SPARK-21428
> URL: https://issues.apache.org/jira/browse/SPARK-21428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.3, 2.0.2, 2.1.1, 2.2.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 2.3.0
>
>
> When using bin/spark-sql with the builtin hive jars, we are expecting to 
> reuse the instance ofCliSessionState.
> {quote}
> // In `SparkSQLCLIDriver`, we have already started a 
> `CliSessionState`,
> // which contains information like configurations from command line. 
> Later
> // we call `SparkSQLEnv.init()` there, which would run into this part 
> again.
> // so we should keep `conf` and reuse the existing instance of 
> `CliSessionState`.
> {quote}
> Actually it never ever happened since SessionState.get()  at 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L138
>  will always be null by IsolatedClientLoader.
> The SessionState.start was called many times, which will creates 
> `hive.exec.strachdir`, see the following case...
> {code:java}
> spark git:(master) bin/spark-sql --conf spark.sql.hive.metastore.jars=builtin
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 17/07/16 23:29:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/07/16 23:29:04 INFO HiveMetaStore: 0: Opening raw store with implemenation 
> class:org.apache.hadoop.hive.metastore.ObjectStore
> 17/07/16 23:29:04 INFO ObjectStore: ObjectStore, initialize called
> 17/07/16 23:29:04 INFO Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 17/07/16 23:29:04 INFO Persistence: Property datanucleus.cache.level2 unknown 
> - will be ignored
> 17/07/16 23:29:05 INFO ObjectStore: Setting MetaStore object pin classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 17/07/16 23:29:06 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 17/07/16 23:29:06 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" 
> so does not have its own datastore table.
> 17/07/16 23:29:07 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 17/07/16 23:29:07 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" 
> so does not have its own datastore table.
> 17/07/16 23:29:07 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is 
> DERBY
> 17/07/16 23:29:07 INFO ObjectStore: Initialized ObjectStore
> 17/07/16 23:29:07 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 17/07/16 23:29:07 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> 17/07/16 23:29:08 INFO HiveMetaStore: Added admin role in metastore
> 17/07/16 23:29:08 INFO HiveMetaStore: Added public role in metastore
> 17/07/16 23:29:08 INFO HiveMetaStore: No user is added in admin role, since 
> config is empty
> 17/07/16 23:29:08 INFO HiveMetaStore: 0: get_all_databases
> 17/07/16 23:29:08 INFO audit: ugi=Kentip=unknown-ip-addr  
> cmd=get_all_databases
> 17/07/16 23:29:08 INFO HiveMetaStore: 0: get_functions: db=default pat=*
> 17/07/16 23:29:08 INFO audit: ugi=Kentip=unknown-ip-addr  
> cmd=get_functions: db=default pat=*
> 17/07/16 23:29:08 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 17/07/16 23:29:08 INFO SessionState: Created local directory: 
> /var/folders/k2/04p4k4ws73l6711h_mz2_tq0gn/T/a2c40e42-08e2-4023-8464-3432ed690184_resources
> 17/07/16 23:29:08 INFO SessionState: Created HDFS directory: 
> /tmp/hive/Kent/a2c40e42-08e2-4023-8464-3432ed690184
> 17/07/16 23:29:08 INFO SessionState: Created local directory: 
> /var/folders/k2/04p4k4ws73l6711h_mz2_tq0gn/T/Kent/a2c40e42-08e2-4023-8464-3432ed690184
> 17/07/16 23:29:08 INFO SessionState: Created HDFS directory: 
> /tmp/hive/Kent/a2c40e42-08e2-4023-8464-3432ed690184/_tmp_space.db
> 1

[jira] [Resolved] (SPARK-21428) CliSessionState never be recognized because of IsolatedClientLoader

2017-08-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21428.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18648
[https://github.com/apache/spark/pull/18648]

> CliSessionState never be recognized because of IsolatedClientLoader
> ---
>
> Key: SPARK-21428
> URL: https://issues.apache.org/jira/browse/SPARK-21428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.3, 2.0.2, 2.1.1, 2.2.0
>Reporter: Kent Yao
>Priority: Minor
> Fix For: 2.3.0
>
>
> When using bin/spark-sql with the builtin hive jars, we are expecting to 
> reuse the instance ofCliSessionState.
> {quote}
> // In `SparkSQLCLIDriver`, we have already started a 
> `CliSessionState`,
> // which contains information like configurations from command line. 
> Later
> // we call `SparkSQLEnv.init()` there, which would run into this part 
> again.
> // so we should keep `conf` and reuse the existing instance of 
> `CliSessionState`.
> {quote}
> Actually it never ever happened since SessionState.get()  at 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala#L138
>  will always be null by IsolatedClientLoader.
> The SessionState.start was called many times, which will creates 
> `hive.exec.strachdir`, see the following case...
> {code:java}
> spark git:(master) bin/spark-sql --conf spark.sql.hive.metastore.jars=builtin
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 17/07/16 23:29:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 17/07/16 23:29:04 INFO HiveMetaStore: 0: Opening raw store with implemenation 
> class:org.apache.hadoop.hive.metastore.ObjectStore
> 17/07/16 23:29:04 INFO ObjectStore: ObjectStore, initialize called
> 17/07/16 23:29:04 INFO Persistence: Property 
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 17/07/16 23:29:04 INFO Persistence: Property datanucleus.cache.level2 unknown 
> - will be ignored
> 17/07/16 23:29:05 INFO ObjectStore: Setting MetaStore object pin classes with 
> hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
> 17/07/16 23:29:06 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 17/07/16 23:29:06 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" 
> so does not have its own datastore table.
> 17/07/16 23:29:07 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 17/07/16 23:29:07 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" 
> so does not have its own datastore table.
> 17/07/16 23:29:07 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is 
> DERBY
> 17/07/16 23:29:07 INFO ObjectStore: Initialized ObjectStore
> 17/07/16 23:29:07 WARN ObjectStore: Version information not found in 
> metastore. hive.metastore.schema.verification is not enabled so recording the 
> schema version 1.2.0
> 17/07/16 23:29:07 WARN ObjectStore: Failed to get database default, returning 
> NoSuchObjectException
> 17/07/16 23:29:08 INFO HiveMetaStore: Added admin role in metastore
> 17/07/16 23:29:08 INFO HiveMetaStore: Added public role in metastore
> 17/07/16 23:29:08 INFO HiveMetaStore: No user is added in admin role, since 
> config is empty
> 17/07/16 23:29:08 INFO HiveMetaStore: 0: get_all_databases
> 17/07/16 23:29:08 INFO audit: ugi=Kentip=unknown-ip-addr  
> cmd=get_all_databases
> 17/07/16 23:29:08 INFO HiveMetaStore: 0: get_functions: db=default pat=*
> 17/07/16 23:29:08 INFO audit: ugi=Kentip=unknown-ip-addr  
> cmd=get_functions: db=default pat=*
> 17/07/16 23:29:08 INFO Datastore: The class 
> "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as 
> "embedded-only" so does not have its own datastore table.
> 17/07/16 23:29:08 INFO SessionState: Created local directory: 
> /var/folders/k2/04p4k4ws73l6711h_mz2_tq0gn/T/a2c40e42-08e2-4023-8464-3432ed690184_resources
> 17/07/16 23:29:08 INFO SessionState: Created HDFS directory: 
> /tmp/hive/Kent/a2c40e42-08e2-4023-8464-3432ed690184
> 17/07/16 23:29:08 INFO SessionState: Created local directory: 
> /var/folders/k2/04p4k4ws73l6711h_mz2_tq0gn/T/Kent/a2c40e42-08e2-4023-8464-3432ed690184
> 17/07/16 23:29:08 INFO SessionState: Created HDFS directory:

[jira] [Created] (SPARK-21762) FileFormatWriter metrics collection fails if a newly close()d file isn't yet visible

2017-08-17 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-21762:
--

 Summary: FileFormatWriter metrics collection fails if a newly 
close()d file isn't yet visible
 Key: SPARK-21762
 URL: https://issues.apache.org/jira/browse/SPARK-21762
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
 Environment: object stores without complete creation consistency (this 
includes AWS S3's caching of negative GET results)
Reporter: Steve Loughran
Priority: Minor


The metrics collection of SPARK-20703 can trigger premature failure if the 
newly written object isn't actually visible yet, that is if, after 
{{writer.close()}}, a {{getFileStatus(path)}} returns a 
{{FileNotFoundException}}.

Strictly speaking, not having a file immediately visible goes against the 
fundamental expectations of the Hadoop FS APIs, namely full consistent data & 
medata across all operations, with immediate global visibility of all changes. 
However, not all object stores make that guarantee, be it only newly created 
data or updated blobs. And so spurious FNFEs can get raised, ones which 
*should* have gone away by the time the actual task is committed. Or if they 
haven't, the job is in such deep trouble.

What to do?
# leave as is: fail fast & so catch blobstores/blobstore clients which don't 
behave as required. One issue here: will that trigger retries, what happens 
there, etc, etc.
# Swallow the FNFE and hope the file is observable later.
# Swallow all IOEs and hope that whatever problem the FS has is transient.

Options 2 & 3 aren't going to collect metrics in the event of a FNFE, or at 
least, not the counter of bytes written.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-17 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130525#comment-16130525
 ] 

Stavros Kontopoulos edited comment on SPARK-21752 at 8/17/17 3:15 PM:
--

[~jsnowacki] I dont think I am doing anything wrong. I followed your 
instructions. I use pyspark which comes with the spark distro no need to 
install it on my system.

So when I do:
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_DRIVER_PYTHON=jupyter
and then ./pyspark
I have a fully working jupyter notebook.
Also by typing in a cell spark, a spark session is already defined and there is 
also sc defined.
SparkSession - in-memory
SparkContext
Spark UI
Version
v2.3.0-SNAPSHOT
Master
local[*]
AppName
PySparkShell

So its not the case that you need to setup spark session on your own unless 
things are setup in some other way I am not familiar to (likely).

Then I run your example but the --packages has no effect.

{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
people = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 1000), 
("Thorin", 195), ("Balin", 178), ("Kili", 77),
   ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 
None)], ["name", "age"])

people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

{code}

Check here:
https://github.com/jupyter/notebook/issues/743
https://gist.github.com/ololobus/4c221a0891775eaa86b0
for someways to start things. 

Now, I suspect this is the responsible line 
https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54
so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what 
I observed java gateway is used once when my pythonbook
is started. You can easily check that by modifying the file to print something 
and also by checking if you have spark already defined as in my case. I 
searched the places where this variable is utilized so nothing related to 
SparkConf unless somehow you use spark submit (pyspark calls that btw). I will 
try installing pyspark with pip but not sure if it will make any difference. 



was (Author: skonto):
[~jsnowacki] I dont think I am doing anything wrong. I followed your 
instructions. I use pyspark which comes with the spark distro no need to 
install it on my system.

So when I do:
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_DRIVER_PYTHON=jupyter
and then ./pyspark
I have a fully working jupyter notebook.
Also by typing in a cell spark, a spark session is already defined and there is 
also sc defined.
SparkSession - in-memory
SparkContext
Spark UI
Version
v2.3.0-SNAPSHOT
Master
local[*]
AppName
PySparkShell

So its not the case that you need to setup spark session on your own unless 
things are setup in some other way I am not familiar to (likely).

Then I run your example but the --packages has no effect.

{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
people = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 1000), 
("Thorin", 195), ("Balin", 178), ("Kili", 77),
   ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 
None)], ["name", "age"])

people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

{code}

Check here:
https://github.com/jupyter/notebook/issues/743
https://gist.github.com/ololobus/4c221a0891775eaa86b0
for someways to start things. 

Now, I suspect this is the responsible line 
https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54
so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what 
I observed java gateway is used once when my pythonbook
is started. You can easily check that by modifying the file to print something 
and also by checking if you have spark already defined as in my case. I 
searched the places where this variable is utilized s

[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-17 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130525#comment-16130525
 ] 

Stavros Kontopoulos edited comment on SPARK-21752 at 8/17/17 3:12 PM:
--

[~jsnowacki] I dont think I am doing anything wrong. I followed your 
instructions. I use pyspark which comes with the spark distro no need to 
install it on my system.

So when I do:
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_DRIVER_PYTHON=jupyter
and then ./pyspark
I have a fully working jupyter notebook.
Also by typing in a cell spark, a spark session is already defined and there is 
also sc defined.
SparkSession - in-memory
SparkContext
Spark UI
Version
v2.3.0-SNAPSHOT
Master
local[*]
AppName
PySparkShell

So its not the case that you need to setup spark session on your own unless 
things are setup in some other way I am not familiar to (likely).

Then I run your example but the --packages has no effect.

{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
people = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 1000), 
("Thorin", 195), ("Balin", 178), ("Kili", 77),
   ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 
None)], ["name", "age"])

people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

{code}

Check here:
https://github.com/jupyter/notebook/issues/743
https://gist.github.com/ololobus/4c221a0891775eaa86b0
for someways to start things. 

Now, I suspect this is the responsible line 
https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54
so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what 
I observed java gateway is used once when my pythonbook
is started. You can easily check that by modifying the file to print something 
and also by checking if you have spark already defined as in my case. I 
searched the places where this variable is utilized so nothing related to 
SparkConf unless somehow you use spark submit (pyspark calls that btw). I will 
try installing pyspark with pip but not sure if it will make any difference.



was (Author: skonto):
[~jsnowacki] I dont think I am doing anything wrong. I followed your 
instructions. I use pyspark which comes with the spark distro no need to 
install it on my system.

So when I do:
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_DRIVER_PYTHON=jupyter
and then ./pyspark
I have a fully working jupyter notebook.
Also by typing in a cell spark, a spark session is already defined and there is 
also sc defined.
SparkSession - in-memory
SparkContext
Spark UI
Version
v2.3.0-SNAPSHOT
Master
local[*]
AppName
PySparkShell

So its not the case that you need to setup spark session on your own unless 
things are setup in some other way I am not familiar to (likely).

Then I run your example but the --packages has no effect.

{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
people = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 1000), 
("Thorin", 195), ("Balin", 178), ("Kili", 77),
   ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 
None)], ["name", "age"])

people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

{code}

Check here:
https://github.com/jupyter/notebook/issues/743
https://gist.github.com/ololobus/4c221a0891775eaa86b0
for someways to start things. 

Now, I suspect this is the responsible line 
https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54
so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what 
I observed java gateway is used once when my pythonbook
is started. You can easily check that by modifying the file to print something 
and also by checking if you have spark already defined as in my case. I 
searched the places where this variable is utilized so

[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-17 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130525#comment-16130525
 ] 

Stavros Kontopoulos edited comment on SPARK-21752 at 8/17/17 3:11 PM:
--

[~jsnowacki] I dont think I am doing anything wrong. I followed your 
instructions. I use pyspark which comes with the spark distro no need to 
install it on my system.

So when I do:
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_DRIVER_PYTHON=jupyter
and then ./pyspark
I have a fully working jupyter notebook.
Also by typing in a cell spark, a spark session is already defined and there is 
also sc defined.
SparkSession - in-memory
SparkContext
Spark UI
Version
v2.3.0-SNAPSHOT
Master
local[*]
AppName
PySparkShell

So its not the case that you need to setup spark session on your own unless 
things are setup in some other way I am not familiar to (likely).

Then I run your example but the --packages has no effect.

{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
people = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 1000), 
("Thorin", 195), ("Balin", 178), ("Kili", 77),
   ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 
None)], ["name", "age"])

people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

{code}

Check here:
https://github.com/jupyter/notebook/issues/743
https://gist.github.com/ololobus/4c221a0891775eaa86b0
for someways to start things. 

Now, I suspect this is the responsible line 
https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54
so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what 
I observed java gateway is used once when my pythonbook
is started. You can easily check that by modifying the file to print something 
and also by checking if you have spark already defined as in my case. I 
searched the places where this variable is utilized so nothing related to 
SparkConf unless somehow you use spark submit (pyspark calls that btw).



was (Author: skonto):
[~jsnowacki] I dont think I am doing anything wrong. I followed your 
instructions. I use pyspark which comes with the spark distro no need to 
install it on my system.

So when I do:
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_DRIVER_PYTHON=jupyter
and then ./pyspark
I have a fully working jupyter notebook.
Also by typing in a cell spark, a spark session is already defined and there is 
also sc defined.
SparkSession - in-memory
SparkContext
Spark UI
Version
v2.3.0-SNAPSHOT
Master
local[*]
AppName
PySparkShell

So its not the case that you need to setup spark session on your own unless 
things are setup in some other way I am not familiar to (likely).

Then I run your example but the --packages has no effect.

{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
people = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 1000), 
("Thorin", 195), ("Balin", 178), ("Kili", 77),
   ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 
None)], ["name", "age"])

people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

{code}

Check here:
https://github.com/jupyter/notebook/issues/743
https://gist.github.com/ololobus/4c221a0891775eaa86b0
for someways to start things. 

Now, I suspect this is the responsible line 
https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54
so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what 
I observed java gateway is used once when my pythonbook
is started. You can easily check that by modifying the file to print something 
and also by checking if you have spark already defined as in my case. 


> Config spark.jars.packages is ignored in SparkSession config
> 
>
>  

[jira] [Comment Edited] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-17 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130525#comment-16130525
 ] 

Stavros Kontopoulos edited comment on SPARK-21752 at 8/17/17 3:02 PM:
--

[~jsnowacki] I dont think I am doing anything wrong. I followed your 
instructions. I use pyspark which comes with the spark distro no need to 
install it on my system.

So when I do:
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_DRIVER_PYTHON=jupyter
and then ./pyspark
I have a fully working jupyter notebook.
Also by typing in a cell spark, a spark session is already defined and there is 
also sc defined.
SparkSession - in-memory
SparkContext
Spark UI
Version
v2.3.0-SNAPSHOT
Master
local[*]
AppName
PySparkShell

So its not the case that you need to setup spark session on your own unless 
things are setup in some other way I am not familiar to (likely).

Then I run your example but the --packages has no effect.

{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
people = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 1000), 
("Thorin", 195), ("Balin", 178), ("Kili", 77),
   ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 
None)], ["name", "age"])

people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

{code}

Check here:
https://github.com/jupyter/notebook/issues/743
https://gist.github.com/ololobus/4c221a0891775eaa86b0
for someways to start things. 

Now, I suspect this is the responsible line 
https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54
so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what 
I observed java gateway is used once when my pythonbook
is started. You can easily check that by modifying the file to print something 
and also by checking if you have spark already defined as in my case. 



was (Author: skonto):
[~jsnowacki] I dont think I am doing anything wrong. I followed your 
instructions. I use pyspark which comes with the spark distro no need to 
install it on my system.

So when I do:
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_DRIVER_PYTHON=jupyter
and then ./pyspark
I have a fully working jupyter notebook.
Also by typing in a cell spark, a spark session is already defined and there is 
also sc defined.
SparkSession - in-memory
SparkContext
Spark UI
Version
v2.3.0-SNAPSHOT
Master
local[*]
AppName
PySparkShell

So its not the case that you need to setup spark session on your own unless 
things are setup in some other way I am not familiar with.

Then I run your example but the --packages has no effect.

{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
people = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 1000), 
("Thorin", 195), ("Balin", 178), ("Kili", 77),
   ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 
None)], ["name", "age"])

people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

{code}

Check here:
https://github.com/jupyter/notebook/issues/743
https://gist.github.com/ololobus/4c221a0891775eaa86b0
for someways to start things. 

Now, I suspect this is the responsible line 
https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54
so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what 
I observed java gateway is used once when my pythonbook
is started. You can easily check that by modifying the file to print something 
and also by checking if you have spark already defined as in my case. 


> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  

[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-17 Thread Stavros Kontopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130525#comment-16130525
 ] 

Stavros Kontopoulos commented on SPARK-21752:
-

[~jsnowacki] I dont think I am doing anything wrong. I followed your 
instructions. I use pyspark which comes with the spark distro no need to 
install it on my system.

So when I do:
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_DRIVER_PYTHON=jupyter
and then ./pyspark
I have a fully working jupyter notebook.
Also by typing in a cell spark, a spark session is already defined and there is 
also sc defined.
SparkSession - in-memory
SparkContext
Spark UI
Version
v2.3.0-SNAPSHOT
Master
local[*]
AppName
PySparkShell

So its not the case that you need to setup spark session on your own unless 
things are setup in some other way I am not familiar with.

Then I run your example but the --packages has no effect.

{code:java}
import pyspark
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()
people = spark.createDataFrame([("Bilbo Baggins",  50), ("Gandalf", 1000), 
("Thorin", 195), ("Balin", 178), ("Kili", 77),
   ("Dwalin", 169), ("Oin", 167), ("Gloin", 158), ("Fili", 82), ("Bombur", 
None)], ["name", "age"])

people.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()

{code}

Check here:
https://github.com/jupyter/notebook/issues/743
https://gist.github.com/ololobus/4c221a0891775eaa86b0
for someways to start things. 

Now, I suspect this is the responsible line 
https://github.com/apache/spark/blob/d695a528bef6291e0e1657f4f3583a8371abd7c8/python/pyspark/java_gateway.py#L54
so that PYSPARK_SUBMIT_ARGS is taken into consideration but as I said from what 
I observed java gateway is used once when my pythonbook
is started. You can easily check that by modifying the file to print something 
and also by checking if you have spark already defined as in my case. 


> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.
> Note that this is related to creating new {{SparkSession}} as getting new 
> packages into existing {{SparkSession}} doesn't indeed make sense. Thus this 
> will only work with bare Python, Scala or Java, and not on {{pyspark}} or 
> {{spark-shell}} as they create the session automatically; it this case one 
> would need to use {{--packages}} option. 



--
This message was sent by Atlassian JIRA
(v6.4.14#

[jira] [Commented] (SPARK-15689) Data source API v2

2017-08-17 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130494#comment-16130494
 ] 

Dongjoon Hyun commented on SPARK-15689:
---

Thank you for the document, too!

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2017-08-17 Thread Varene Olivier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130472#comment-16130472
 ] 

Varene Olivier commented on SPARK-21063:


Hi,
I am experiencing the same issue with Spark 2.2.0
and HiveServer2 via http

{code:scala}
// transform scala Map[String,String] to Java Properties (needed by jdbc driver)

implicit def map2Properties(map:Map[String,String]):java.util.Properties = {
(new java.util.Properties /: map) {case (props, (k,v)) => props.put(k,v); 
props}
  }

val url = 
"jdbc:hive2://remote.server:10001/myDatabase;transportMode=http;httpPath=cliservice"
// from "org.apache.hive" % "hive-jdbc" % "1.2.2"
val driver = "org.apache.hive.jdb.HiveServer"
val user = "myRemoteUser"
val password = "myRemotePassword"
val table = "myNonEmptyTable"

val props = Map("user" -> user,"password" -> password, "driver" -> driver)

val d = spark.read.jdbc(url,table,props)

println(d.count)

{code}

returns :

{code}
0
{code}

and my table is not empty

> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Peter Bykov
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2017-08-17 Thread Varene Olivier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130472#comment-16130472
 ] 

Varene Olivier edited comment on SPARK-21063 at 8/17/17 2:25 PM:
-

Hi,
I am experiencing the same issue with Spark 2.2.0
and HiveServer2 via http

{code}
// transform scala Map[String,String] to Java Properties (needed by jdbc driver)

implicit def map2Properties(map:Map[String,String]):java.util.Properties = {
(new java.util.Properties /: map) {case (props, (k,v)) => props.put(k,v); 
props}
  }

val url = 
"jdbc:hive2://remote.server:10001/myDatabase;transportMode=http;httpPath=cliservice"
// from "org.apache.hive" % "hive-jdbc" % "1.2.2"
val driver = "org.apache.hive.jdb.HiveServer"
val user = "myRemoteUser"
val password = "myRemotePassword"
val table = "myNonEmptyTable"

val props = Map("user" -> user,"password" -> password, "driver" -> driver)

val d = spark.read.jdbc(url,table,props)

println(d.count)

{code}

returns :

{code}
0
{code}

and my table is not empty


was (Author: ov):
Hi,
I am experiencing the same issue with Spark 2.2.0
and HiveServer2 via http

{code:scala}
// transform scala Map[String,String] to Java Properties (needed by jdbc driver)

implicit def map2Properties(map:Map[String,String]):java.util.Properties = {
(new java.util.Properties /: map) {case (props, (k,v)) => props.put(k,v); 
props}
  }

val url = 
"jdbc:hive2://remote.server:10001/myDatabase;transportMode=http;httpPath=cliservice"
// from "org.apache.hive" % "hive-jdbc" % "1.2.2"
val driver = "org.apache.hive.jdb.HiveServer"
val user = "myRemoteUser"
val password = "myRemotePassword"
val table = "myNonEmptyTable"

val props = Map("user" -> user,"password" -> password, "driver" -> driver)

val d = spark.read.jdbc(url,table,props)

println(d.count)

{code}

returns :

{code}
0
{code}

and my table is not empty

> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Peter Bykov
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21642) Use FQDN for DRIVER_HOST_ADDRESS instead of ip address

2017-08-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21642:
---

Assignee: Aki Tanaka  (was: Hideaki Tanaka)

> Use FQDN for DRIVER_HOST_ADDRESS instead of ip address
> --
>
> Key: SPARK-21642
> URL: https://issues.apache.org/jira/browse/SPARK-21642
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.2.0
>Reporter: Aki Tanaka
>Assignee: Aki Tanaka
> Fix For: 2.3.0
>
>
> In current implementation, ip address of a driver host is set to 
> DRIVER_HOST_ADDRESS [1]. This becomes a problem when we enable SSL using 
> "spark.ssl.enabled", "spark.ssl.trustStore" and "spark.ssl.keyStore" 
> properties. When we configure these properties, spark web ui is launched with 
> SSL enabled and the HTTPS server is configured with the custom SSL 
> certificate you configured in these properties.
> In this case, client gets javax.net.ssl.SSLPeerUnverifiedException exception 
> when the client accesses the spark web ui because the client fails to verify 
> the SSL certificate (Common Name of the SSL cert does not match with 
> DRIVER_HOST_ADDRESS).
> To avoid the exception, we should use FQDN of the driver host for 
> DRIVER_HOST_ADDRESS.
> [1]  
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L222
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L942
> Error message that client gets when the client accesses spark web ui:
> javax.net.ssl.SSLPeerUnverifiedException: Certificate for <10.102.138.239> 
> doesn't match any of the subject alternative names: []
> {code:java}
> $ spark-submit /path/to/jar
> ..
> 17/08/04 14:48:07 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 17/08/04 14:48:07 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http://10.43.3.8:4040
> $ curl -I http://10.43.3.8:4040
> HTTP/1.1 302 Found
> Date: Fri, 04 Aug 2017 14:48:20 GMT
> Location: https://10.43.3.8:4440/
> Content-Length: 0
> Server: Jetty(9.2.z-SNAPSHOT)
> $ curl -v https://10.43.3.8:4440
> * Rebuilt URL to: https://10.43.3.8:4440/
> *   Trying 10.43.3.8...
> * TCP_NODELAY set
> * Connected to 10.43.3.8 (10.43.3.8) port 4440 (#0)
> * Initializing NSS with certpath: sql:/etc/pki/nssdb
> *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
>   CApath: none
> * Server certificate:
> * subject: CN=*.example.com,OU=MyDept,O=MyOrg,L=Area,C=US
> * start date: Jun 12 00:05:02 2017 GMT
> * expire date: Jun 12 00:05:02 2018 GMT
> * common name: *.example.com
> * issuer: CN=*.example.com,OU=MyDept,O=MyOrg,L=Area,C=US
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21642) Use FQDN for DRIVER_HOST_ADDRESS instead of ip address

2017-08-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21642:
---

Assignee: Hideaki Tanaka

> Use FQDN for DRIVER_HOST_ADDRESS instead of ip address
> --
>
> Key: SPARK-21642
> URL: https://issues.apache.org/jira/browse/SPARK-21642
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.2.0
>Reporter: Aki Tanaka
>Assignee: Hideaki Tanaka
> Fix For: 2.3.0
>
>
> In current implementation, ip address of a driver host is set to 
> DRIVER_HOST_ADDRESS [1]. This becomes a problem when we enable SSL using 
> "spark.ssl.enabled", "spark.ssl.trustStore" and "spark.ssl.keyStore" 
> properties. When we configure these properties, spark web ui is launched with 
> SSL enabled and the HTTPS server is configured with the custom SSL 
> certificate you configured in these properties.
> In this case, client gets javax.net.ssl.SSLPeerUnverifiedException exception 
> when the client accesses the spark web ui because the client fails to verify 
> the SSL certificate (Common Name of the SSL cert does not match with 
> DRIVER_HOST_ADDRESS).
> To avoid the exception, we should use FQDN of the driver host for 
> DRIVER_HOST_ADDRESS.
> [1]  
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L222
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L942
> Error message that client gets when the client accesses spark web ui:
> javax.net.ssl.SSLPeerUnverifiedException: Certificate for <10.102.138.239> 
> doesn't match any of the subject alternative names: []
> {code:java}
> $ spark-submit /path/to/jar
> ..
> 17/08/04 14:48:07 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 17/08/04 14:48:07 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http://10.43.3.8:4040
> $ curl -I http://10.43.3.8:4040
> HTTP/1.1 302 Found
> Date: Fri, 04 Aug 2017 14:48:20 GMT
> Location: https://10.43.3.8:4440/
> Content-Length: 0
> Server: Jetty(9.2.z-SNAPSHOT)
> $ curl -v https://10.43.3.8:4440
> * Rebuilt URL to: https://10.43.3.8:4440/
> *   Trying 10.43.3.8...
> * TCP_NODELAY set
> * Connected to 10.43.3.8 (10.43.3.8) port 4440 (#0)
> * Initializing NSS with certpath: sql:/etc/pki/nssdb
> *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
>   CApath: none
> * Server certificate:
> * subject: CN=*.example.com,OU=MyDept,O=MyOrg,L=Area,C=US
> * start date: Jun 12 00:05:02 2017 GMT
> * expire date: Jun 12 00:05:02 2018 GMT
> * common name: *.example.com
> * issuer: CN=*.example.com,OU=MyDept,O=MyOrg,L=Area,C=US
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21642) Use FQDN for DRIVER_HOST_ADDRESS instead of ip address

2017-08-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21642.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18846
[https://github.com/apache/spark/pull/18846]

> Use FQDN for DRIVER_HOST_ADDRESS instead of ip address
> --
>
> Key: SPARK-21642
> URL: https://issues.apache.org/jira/browse/SPARK-21642
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.2.0
>Reporter: Aki Tanaka
> Fix For: 2.3.0
>
>
> In current implementation, ip address of a driver host is set to 
> DRIVER_HOST_ADDRESS [1]. This becomes a problem when we enable SSL using 
> "spark.ssl.enabled", "spark.ssl.trustStore" and "spark.ssl.keyStore" 
> properties. When we configure these properties, spark web ui is launched with 
> SSL enabled and the HTTPS server is configured with the custom SSL 
> certificate you configured in these properties.
> In this case, client gets javax.net.ssl.SSLPeerUnverifiedException exception 
> when the client accesses the spark web ui because the client fails to verify 
> the SSL certificate (Common Name of the SSL cert does not match with 
> DRIVER_HOST_ADDRESS).
> To avoid the exception, we should use FQDN of the driver host for 
> DRIVER_HOST_ADDRESS.
> [1]  
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L222
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L942
> Error message that client gets when the client accesses spark web ui:
> javax.net.ssl.SSLPeerUnverifiedException: Certificate for <10.102.138.239> 
> doesn't match any of the subject alternative names: []
> {code:java}
> $ spark-submit /path/to/jar
> ..
> 17/08/04 14:48:07 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 17/08/04 14:48:07 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
> http://10.43.3.8:4040
> $ curl -I http://10.43.3.8:4040
> HTTP/1.1 302 Found
> Date: Fri, 04 Aug 2017 14:48:20 GMT
> Location: https://10.43.3.8:4440/
> Content-Length: 0
> Server: Jetty(9.2.z-SNAPSHOT)
> $ curl -v https://10.43.3.8:4440
> * Rebuilt URL to: https://10.43.3.8:4440/
> *   Trying 10.43.3.8...
> * TCP_NODELAY set
> * Connected to 10.43.3.8 (10.43.3.8) port 4440 (#0)
> * Initializing NSS with certpath: sql:/etc/pki/nssdb
> *   CAfile: /etc/pki/tls/certs/ca-bundle.crt
>   CApath: none
> * Server certificate:
> * subject: CN=*.example.com,OU=MyDept,O=MyOrg,L=Area,C=US
> * start date: Jun 12 00:05:02 2017 GMT
> * expire date: Jun 12 00:05:02 2018 GMT
> * common name: *.example.com
> * issuer: CN=*.example.com,OU=MyDept,O=MyOrg,L=Area,C=US
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21743) top-most limit should not cause memory leak

2017-08-17 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130387#comment-16130387
 ] 

Wenchen Fan commented on SPARK-21743:
-

issue resolved by https://github.com/apache/spark/pull/18955

> top-most limit should not cause memory leak
> ---
>
> Key: SPARK-21743
> URL: https://issues.apache.org/jira/browse/SPARK-21743
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15689) Data source API v2

2017-08-17 Thread Russell Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130355#comment-16130355
 ] 

Russell Spitzer commented on SPARK-15689:
-

Thanks [~cloud_fan] for posting the design doc it was a great read and I like a 
lot of the direction this is going in. It would helpful if we could have access 
to the doc as a google doc or some other editable/comment-able form though to 
encourage discussion.

I left some comments on the prototype but one thing I think could be a great 
addition would be a joinInterface. I ended up writing up one of these 
specifically for Cassandra and had to do a lot of plumbing to get it to fit 
into the rest of the Catalyst ecosystem so I think this would be a great time 
to plan ahead in Spark design. 

The join interface would look a lot like a combination of the read and write 
apis, given a row input and a set of expressions the relationship should return 
rows that match those expressions OR fallback to just being a read relationship 
if none of the expressions can be satisfied by the join (leaving all the 
expressions to be evaluated in spark). 

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15689) Data source API v2

2017-08-17 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15689:

Attachment: SPIP Data Source API V2.pdf

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>  Labels: releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21761) [Core] Add the application's final state for SparkListenerApplicationEnd event

2017-08-17 Thread lishuming (JIRA)
lishuming created SPARK-21761:
-

 Summary: [Core] Add the application's final state for 
SparkListenerApplicationEnd event
 Key: SPARK-21761
 URL: https://issues.apache.org/jira/browse/SPARK-21761
 Project: Spark
  Issue Type: Wish
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: lishuming
Priority: Minor



When add an extra `SparkListener`, we want to get the application's final state 
which I think is necessary to record. Maybe we can change 
`SparkListenerApplicationEnd` as below:

{code:java}
case class SparkListenerApplicationEnd(time: Long, sparkUser: String) extends 
SparkListenerEvent

import org.apache.spark.launcher.SparkAppHandle.State
case class SparkListenerApplicationEnd(time: Long, sparkUser: String, status: 
State) extends SparkListenerEvent
{code}

Of course, we should add some implements to this change for different deployed 
mode. Can someone give me some advice?




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21758) `SHOW TBLPROPERTIES` can not get properties start with spark.sql.*

2017-08-17 Thread Feng Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130288#comment-16130288
 ] 

Feng Zhu commented on SPARK-21758:
--

I can't reproduce this issue in 2.1 and master branch, could you provide more 
details?

> `SHOW TBLPROPERTIES` can not get properties start with spark.sql.*
> --
>
> Key: SPARK-21758
> URL: https://issues.apache.org/jira/browse/SPARK-21758
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: StanZhai
>Priority: Critical
>
> SQL: SHOW TBLPROPERTIES test_tb("spark.sql.sources.schema.numParts")
> Exception: Table test_db.test.tb does not have property: 
> spark.sql.sources.schema.numParts
> The `spark.sql.sources.schema.numParts` property exactly exists in 
> HiveMetastore.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-17 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130282#comment-16130282
 ] 

Jakub Nowacki commented on SPARK-21752:
---

[~skonto] Well, I'm not sure where you're failing here. If you want to get 
PySpark installed with a vanilla Python distribution you can do {{pip install 
pyspark}} or {{conda install -c conda-forge pyspark}}. Other that that, the 
above scripts are complete, bar the {{import pyspark}} as I mentioned before. 
Below I give a sightly more complete example with the env variable:
{code}
import os
import pyspark

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
.config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
.getOrCreate()

l = [("Bilbo Baggins",  50),
 ("Gandalf", 1000),
 ("Thorin", 195),
 ("Balin", 178),
 ("Kili", 77),
 ("Dwalin", 169),
 ("Oin", 167),
 ("Gloin", 158),
 ("Fili", 82),
 ("Bombur", None)]

people = spark.createDataFrame(l, ["name", "age"])

people.write \
.format("com.mongodb.spark.sql.DefaultSource") \
.mode("append") \
.save()

spark.read \
.format("com.mongodb.spark.sql.DefaultSource") \
.load() \
.show()
{code}

and with the {{SparkConfig}} approach:
{code}
import pyspark

conf = pyspark.SparkConf()
conf.set("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config(conf=conf)\
.getOrCreate()

l = [("Bilbo Baggins",  50),
 ("Gandalf", 1000),
 ("Thorin", 195),
 ("Balin", 178),
 ("Kili", 77),
 ("Dwalin", 169),
 ("Oin", 167),
 ("Gloin", 158),
 ("Fili", 82),
 ("Bombur", None)]

people = spark.createDataFrame(l, ["name", "age"])

people.write \
.format("com.mongodb.spark.sql.DefaultSource") \
.mode("append") \
.save()

spark.read \
.format("com.mongodb.spark.sql.DefaultSource") \
.load() \
.show()
{code}

and with the plain {{SparkSession}} config:

{code}
import pyspark

spark = pyspark.sql.SparkSession.builder\
.appName('test-mongo')\
.master('local[*]')\
.config("spark.jars.packages", 
"org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
.config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
.config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
.getOrCreate()

l = [("Bilbo Baggins",  50),
 ("Gandalf", 1000),
 ("Thorin", 195),
 ("Balin", 178),
 ("Kili", 77),
 ("Dwalin", 169),
 ("Oin", 167),
 ("Gloin", 158),
 ("Fili", 82),
 ("Bombur", None)]

people = spark.createDataFrame(l, ["name", "age"])

people.write \
.format("com.mongodb.spark.sql.DefaultSource") \
.mode("append") \
.save()

spark.read \
.format("com.mongodb.spark.sql.DefaultSource") \
.load() \
.show()
{code}

In my case the first two work as expected, and the last one fails with 
{{ClassNotFoundException}}.

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-

[jira] [Commented] (SPARK-21752) Config spark.jars.packages is ignored in SparkSession config

2017-08-17 Thread Jakub Nowacki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130274#comment-16130274
 ] 

Jakub Nowacki commented on SPARK-21752:
---

OK I get the point. I think we should only consider this in an interactive, 
notebook based environment. I don't use the master for sure in the 
{{spark-submit}} executioner, but also using packages internally should be 
discouraged. 

I think it should be a bit more clear in documentation what can and what cannot 
be used. Also, interactive environment like Jupyter or similar should be made 
as an exception, or more clear description for setup should be provided.

Also, especially with using the above setting with packages, there is no 
warning provided that this option is really ignored, thus, maybe one should be 
added similar to the one with reusing existing SparkSession, i.e. 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala#L896

> Config spark.jars.packages is ignored in SparkSession config
> 
>
> Key: SPARK-21752
> URL: https://issues.apache.org/jira/browse/SPARK-21752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jakub Nowacki
>
> If I put a config key {{spark.jars.packages}} using {{SparkSession}} builder 
> as follows:
> {code}
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")\
> .config("spark.mongodb.input.uri", "mongodb://mongo/test.coll") \
> .config("spark.mongodb.output.uri", "mongodb://mongo/test.coll") \
> .getOrCreate()
> {code}
> the SparkSession gets created but there are no package download logs printed, 
> and if I use the loaded classes, Mongo connector in this case, but it's the 
> same for other packages, I get {{java.lang.ClassNotFoundException}} for the 
> missing classes.
> If I use the config file {{conf/spark-defaults.comf}}, command line option 
> {{--packages}}, e.g.:
> {code}
> import os
> os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages 
> org.mongodb.spark:mongo-spark-connector_2.11:2.2.0 pyspark-shell'
> {code}
> it works fine. Interestingly, using {{SparkConf}} object works fine as well, 
> e.g.:
> {code}
> conf = pyspark.SparkConf()
> conf.set("spark.jars.packages", 
> "org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
> conf.set("spark.mongodb.input.uri", "mongodb://mongo/test.coll")
> conf.set("spark.mongodb.output.uri", "mongodb://mongo/test.coll")
> spark = pyspark.sql.SparkSession.builder\
> .appName('test-mongo')\
> .master('local[*]')\
> .config(conf=conf)\
> .getOrCreate()
> {code}
> The above is in Python but I've seen the behavior in other languages, though, 
> I didn't check R. 
> I also have seen it in older Spark versions.
> It seems that this is the only config key that doesn't work for me via the 
> {{SparkSession}} builder config.
> Note that this is related to creating new {{SparkSession}} as getting new 
> packages into existing {{SparkSession}} doesn't indeed make sense. Thus this 
> will only work with bare Python, Scala or Java, and not on {{pyspark}} or 
> {{spark-shell}} as they create the session automatically; it this case one 
> would need to use {{--packages}} option. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >