[jira] [Resolved] (SPARK-21238) allow nested SQL execution

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21238.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18450
[https://github.com/apache/spark/pull/18450]

> allow nested SQL execution
> --
>
> Key: SPARK-21238
> URL: https://issues.apache.org/jira/browse/SPARK-21238
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3577) Add task metric to report spill time

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-3577:
--

Assignee: Sital Kedia

> Add task metric to report spill time
> 
>
> Key: SPARK-3577
> URL: https://issues.apache.org/jira/browse/SPARK-3577
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.1.0
>Reporter: Kay Ousterhout
>Assignee: Sital Kedia
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: spill_size.jpg
>
>
> The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into 
> {{ExternalSorter}}.  The write time recorded in those metrics is never used.  
> We should probably add task metrics to report this spill time, since for 
> shuffles, this would have previously been reported as part of shuffle write 
> time (with the original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3577) Add task metric to report spill time

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-3577.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 17471
[https://github.com/apache/spark/pull/17471]

> Add task metric to report spill time
> 
>
> Key: SPARK-3577
> URL: https://issues.apache.org/jira/browse/SPARK-3577
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
> Fix For: 2.3.0
>
> Attachments: spill_size.jpg
>
>
> The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into 
> {{ExternalSorter}}.  The write time recorded in those metrics is never used.  
> We should probably add task metrics to report this spill time, since for 
> shuffles, this would have previously been reported as part of shuffle write 
> time (with the original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21249) Is it possible to use File Sink with mapGroupsWithState in Structured Streaming?

2017-06-28 Thread Amit Baghel (JIRA)
Amit Baghel created SPARK-21249:
---

 Summary: Is it possible to use File Sink with mapGroupsWithState 
in Structured Streaming?
 Key: SPARK-21249
 URL: https://issues.apache.org/jira/browse/SPARK-21249
 Project: Spark
  Issue Type: Question
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Amit Baghel
Priority: Minor


I am working with 2.2.0-SNAPSHOT and Structured Streaming. Is it possible to 
use File Sink with mapGroupsWithState? With append output mode I am getting 
below exception.

Exception in thread "main" org.apache.spark.sql.AnalysisException: 
mapGroupsWithState is not supported with Append output mode on a streaming 
DataFrame/Dataset;;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067679#comment-16067679
 ] 

Apache Spark commented on SPARK-21093:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18463

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 2.3.0
>
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /u

[jira] [Updated] (SPARK-21223) Thread-safety issue in FsHistoryProvider

2017-06-28 Thread zenglinxi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zenglinxi updated SPARK-21223:
--
Attachment: historyserver_jstack.txt

BTW, this cause an infinite loop problem when we restart historyserver and 
replaying event logs of spark apps.

> Thread-safety issue in FsHistoryProvider 
> -
>
> Key: SPARK-21223
> URL: https://issues.apache.org/jira/browse/SPARK-21223
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: zenglinxi
> Attachments: historyserver_jstack.txt
>
>
> Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class 
> FsHistoryProvider to store the map of eventlog path and attemptInfo. 
> When use ThreadPool to Replay the log files in the list and merge the list of 
> old applications with new ones, multi thread may update fileToAppInfo at the 
> same time, which may cause Thread-safety issues.
> {code:java}
> for (file <- logInfos) {
>tasks += replayExecutor.submit(new Runnable {
> override def run(): Unit = mergeApplicationListing(file)
>  })
>  }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21237) Invalidate stats once table data is changed

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21237.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18449
[https://github.com/apache/spark/pull/18449]

> Invalidate stats once table data is changed
> ---
>
> Key: SPARK-21237
> URL: https://issues.apache.org/jira/browse/SPARK-21237
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21237) Invalidate stats once table data is changed

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21237:
---

Assignee: Zhenhua Wang

> Invalidate stats once table data is changed
> ---
>
> Key: SPARK-21237
> URL: https://issues.apache.org/jira/browse/SPARK-21237
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21229) remove QueryPlan.preCanonicalized

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21229.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18440
[https://github.com/apache/spark/pull/18440]

> remove QueryPlan.preCanonicalized
> -
>
> Key: SPARK-21229
> URL: https://issues.apache.org/jira/browse/SPARK-21229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-06-28 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-21208:
-
Comment: was deleted

(was: User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18431)

> Ability to "setLocalProperty" from sc, in sparkR
> 
>
> Key: SPARK-21208
> URL: https://issues.apache.org/jira/browse/SPARK-21208
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Karuppayya
>
> Checked the API 
> [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
> sparkR.
> Was not able to find a way to *setLocalProperty* on sc.
> Need ability to *setLocalProperty* on sparkContext(similar to available for 
> pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21093:


Assignee: Apache Spark  (was: Hyukjin Kwon)

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Critical
> Fix For: 2.3.0
>
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]

[jira] [Assigned] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21093:


Assignee: Hyukjin Kwon  (was: Apache Spark)

> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 2.3.0
>
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]

[jira] [Comment Edited] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-28 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067637#comment-16067637
 ] 

Felix Cheung edited comment on SPARK-21093 at 6/29/17 3:12 AM:
---

this was reverted.

we seems to be getting random test termination with error code -10 after this 
is merged.


was (Author: felixcheung):
this was reverted.


> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 2.3.0
>
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/li

[jira] [Reopened] (SPARK-21093) Multiple gapply execution occasionally failed in SparkR

2017-06-28 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reopened SPARK-21093:
--

this was reverted.


> Multiple gapply execution occasionally failed in SparkR 
> 
>
> Key: SPARK-21093
> URL: https://issues.apache.org/jira/browse/SPARK-21093
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1, 2.2.0
> Environment: CentOS 7.2.1511 / R 3.4.0, CentOS 7.2.1511 / R 3.3.3
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
> Fix For: 2.3.0
>
>
> On Centos 7.2.1511 with R 3.4.0/3.3.0, multiple execution of {{gapply}} looks 
> failed as below:
> {code}
>  Welcome to
>   __
>/ __/__  ___ _/ /__
>   _\ \/ _ \/ _ `/ __/  '_/
>  /___/ .__/\_,_/_/ /_/\_\   version  2.3.0-SNAPSHOT
> /_/
>  SparkSession available as 'spark'.
> > df <- createDataFrame(list(list(1L, 1, "1", 0.1)), c("a", "b", "c", "d"))
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> 17/06/14 18:21:01 WARN Utils: Truncated the string representation of a plan 
> since it was too large. This behavior can be adjusted by setting 
> 'spark.debug.maxToStringFields' in SparkEnv.conf.
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
>   a b c   d
> 1 1 1 1 0.1
> > collect(gapply(df, "a", function(key, x) { x }, schema(df)))
> Error in handleErrors(returnStatus, conn) :
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 
> in stage 14.0 failed 1 times, most recent failure: Lost task 98.0 in stage 
> 14.0 (TID 1305, localhost, executor driver): org.apache.spark.SparkException: 
> R computation failed with
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:432)
> at 
> org.apache.spark.sql.execution.FlatMapGroupsInRExec$$anonfun$13.apply(objects.scala:414)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.a
> ...
> *** buffer overflow detected ***: /usr/lib64/R/bin/exec/R terminated
> === Backtrace: =
> /lib64/libc.so.6(__fortify_fail+0x37)[0x7fe699b3f597]
> /lib64/libc.so.6(+0x10c750)[0x7fe699b3d750]
> /lib64/libc.so.6(+0x10e507)[0x7fe699b3f507]
> /usr/lib64/R/modules//internet.so(+0x6015)[0x7fe689bb7015]
> /usr/lib64/R/modules//internet.so(+0xe81e)[0x7fe689bbf81e]
> /usr/lib64/R/lib/libR.so(+0xbd1b6)[0x7fe69c54a1b6]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x354)[0x7fe69c5ad2f4]
> /usr/lib64/R/lib/libR.so(+0x123f8e)[0x7fe69c5b0f8e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x589)[0x7fe69c5ad529]
> /usr/lib64/R/lib/libR.so(+0x1254ce)[0x7fe69c5b24ce]
> /usr/lib64/R/lib/libR.so(+0x1104d0)[0x7fe69c59d4d0]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x817)[0x7fe69c5ad7b7]
> /usr/lib64/R/lib/libR.so(+0x1256d1)[0x7fe69c5b26d1]
> /usr/lib64/R/lib/libR.so(+0x1552e9)[0x7fe69c5e22e9]
> /usr/lib64/R/lib/libR.so(+0x11062a)[0x7fe69c59d62a]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x1221af)[0x7fe69c5af1af]
> /usr/lib64/R/lib/libR.so(+0x119101)[0x7fe69c5a6101]
> /usr/lib64/R/lib/libR.so(Rf_eval+0x198)[0x7fe69c5ad138]
> /usr/lib64/R/lib/libR.so(+0x120a7e)[0x7fe69c5ada7e]
> /usr/lib64/R/lib/libR.so(+0x11

[jira] [Commented] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT

2017-06-28 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067620#comment-16067620
 ] 

Yuming Wang commented on SPARK-21246:
-

{{Seq(3)}} should be {{Seq(3L)}}, This works for me:
{code:java}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val schemaString = "name"
val lstVals = Seq(3L)
val rowRdd = sc.parallelize(lstVals).map(x => Row( x ))
rowRdd.collect()
// Generate the schema based on the string of schema
val fields = schemaString.split(" ").map(fieldName => StructField(fieldName, 
LongType, nullable = true))
val schema = StructType(fields)
print(schema)
val peopleDF = spark.createDataFrame(rowRdd, schema)
peopleDF.show()
{code}


> Unexpected Data Type conversion from LONG to BIGINT
> ---
>
> Key: SPARK-21246
> URL: https://issues.apache.org/jira/browse/SPARK-21246
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
> Environment: Using Zeppelin Notebook or Spark Shell
>Reporter: Monica Raj
>
> The unexpected conversion occurred when creating a data frame out of an 
> existing data collection. The following code can be run in zeppelin notebook 
> to reproduce the bug:
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.Row
> val schemaString = "name"
> val lstVals = Seq(3)
> val rowRdd = sc.parallelize(lstVals).map(x => Row( x ))
> rowRdd.collect()
> // Generate the schema based on the string of schema
> val fields = schemaString.split(" ")
> .map(fieldName => StructField(fieldName, LongType, nullable = true))
> val schema = StructType(fields)
> print(schema)
> val peopleDF = sqlContext.createDataFrame(rowRdd, schema)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21224) Support a DDL-formatted string as schema in reading for R

2017-06-28 Thread Felix Cheung (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung reassigned SPARK-21224:


Assignee: Hyukjin Kwon

> Support a DDL-formatted string as schema in reading for R
> -
>
> Key: SPARK-21224
> URL: https://issues.apache.org/jira/browse/SPARK-21224
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> This might have to be a followup for SPARK-20431 but I just decided to make 
> this separate for R specifically as many PRs might be confusing.
> Please refer the discussion in the PR and SPARK-20431.
> In a simple view, this JIRA describes the support for a DDL-formetted string 
> as schema as below:
> {code}
> mockLines <- c("{\"name\":\"Michael\"}",
>"{\"name\":\"Andy\", \"age\":30}",
>"{\"name\":\"Justin\", \"age\":19}")
> jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
> writeLines(mockLines, jsonPath)
> df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
> collect(df)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21224) Support a DDL-formatted string as schema in reading for R

2017-06-28 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067608#comment-16067608
 ] 

Felix Cheung commented on SPARK-21224:
--

let's add this to 
from_json, gapply, dapply

too as discussed? free feel to open a new JIRA if you think that's better.

> Support a DDL-formatted string as schema in reading for R
> -
>
> Key: SPARK-21224
> URL: https://issues.apache.org/jira/browse/SPARK-21224
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This might have to be a followup for SPARK-20431 but I just decided to make 
> this separate for R specifically as many PRs might be confusing.
> Please refer the discussion in the PR and SPARK-20431.
> In a simple view, this JIRA describes the support for a DDL-formetted string 
> as schema as below:
> {code}
> mockLines <- c("{\"name\":\"Michael\"}",
>"{\"name\":\"Andy\", \"age\":30}",
>"{\"name\":\"Justin\", \"age\":19}")
> jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
> writeLines(mockLines, jsonPath)
> df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
> collect(df)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers

2017-06-28 Thread Jiang Xingbo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiang Xingbo updated SPARK-21225:
-
Issue Type: Bug  (was: Improvement)

> decrease the Mem using for variable 'tasks' in function resourceOffers
> --
>
> Key: SPARK-21225
> URL: https://issues.apache.org/jira/browse/SPARK-21225
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: yangZhiguo
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In the function 'resourceOffers', It declare a variable 'tasks' for 
> storage the tasks which have  allocated a executor. It declared like this:
> *{color:#d04437}val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](o.cores)){color}*
> But, I think this code only conside a situation for that one task per core. 
> If the user config the "spark.task.cpus" as 2 or 3, It really don't need so 
> much space. I think It can motify as follow:
> {color:#14892c}*val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14657) RFormula output wrong features when formula w/o intercept

2017-06-28 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-14657.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> RFormula output wrong features when formula w/o intercept
> -
>
> Key: SPARK-14657
> URL: https://issues.apache.org/jira/browse/SPARK-14657
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.3.0
>
>
> SparkR::glm output different features compared with R glm when fit w/o 
> intercept and having string/category features. Take the following example, 
> SparkR output three features compared with four features for native R.
> SparkR::glm
> {quote}
> training <- suppressWarnings(createDataFrame(sqlContext, iris))
> model <- glm(Sepal_Width ~ Sepal_Length + Species - 1, data = training)
> summary(model)
> Coefficients:
> Estimate  Std. Error  t value  Pr(>|t|)
> Sepal_Length0.67468   0.0093013   72.536   0
> Species_versicolor  -1.2349   0.07269 -16.989  0
> Species_virginica   -1.4708   0.077397-19.003  0
> {quote}
> stats::glm
> {quote}
> summary(glm(Sepal.Width ~ Sepal.Length + Species - 1, data = iris))
> Coefficients:
>   Estimate Std. Error t value Pr(>|t|)
> Sepal.Length0.3499 0.0463   7.557 4.19e-12 ***
> Speciessetosa   1.6765 0.2354   7.123 4.46e-11 ***
> Speciesversicolor   0.6931 0.2779   2.494   0.0137 *  
> Speciesvirginica0.6690 0.3078   2.174   0.0313 *  
> {quote}
> The encoder for string/category feature is different. R did not drop any 
> category but SparkR drop the last one.
> I searched online and test some other cases, found when we fit R glm model(or 
> other models powered by R formula) w/o intercept on a dataset including 
> string/category features, one of the categories in the first category feature 
> is being used as reference category, we will not drop any category for that 
> feature.
> I think we should keep consistent semantics between Spark RFormula and R 
> formula.
> cc [~mengxr] 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21222.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18429
[https://github.com/apache/spark/pull/18429]

> Move elimination of Distinct clause from analyzer to optimizer
> --
>
> Key: SPARK-21222
> URL: https://issues.apache.org/jira/browse/SPARK-21222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> Move elimination of Distinct clause from analyzer to optimizer
> Distinct clause is useless after MAX/MIN clause. For example,
> "Select MAX(distinct a) FROM src from"
> is equivalent of
> "Select MAX(a) FROM src from"
> However, this optimization is implemented in analyzer. It should be in 
> optimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer

2017-06-28 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21222:
---

Assignee: Gengliang Wang

> Move elimination of Distinct clause from analyzer to optimizer
> --
>
> Key: SPARK-21222
> URL: https://issues.apache.org/jira/browse/SPARK-21222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> Move elimination of Distinct clause from analyzer to optimizer
> Distinct clause is useless after MAX/MIN clause. For example,
> "Select MAX(distinct a) FROM src from"
> is equivalent of
> "Select MAX(a) FROM src from"
> However, this optimization is implemented in analyzer. It should be in 
> optimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18441) Add Smote in spark mlib and ml

2017-06-28 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067494#comment-16067494
 ] 

yuhao yang commented on SPARK-18441:


Move the Smote code to 
https://gist.github.com/hhbyyh/346467373014943a7f20df208caeb19b

> Add Smote in spark mlib and ml
> --
>
> Key: SPARK-18441
> URL: https://issues.apache.org/jira/browse/SPARK-18441
> Project: Spark
>  Issue Type: Wish
>  Components: ML, MLlib
>Affects Versions: 2.0.1
>Reporter: lichenglin
>
> PLZ Add Smote in spark mlib and ml in case of  the "not balance of train 
> data" for Classification



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21248) Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets (failOnDataLoss: true)

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067425#comment-16067425
 ] 

Apache Spark commented on SPARK-21248:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/18461

> Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets 
> (failOnDataLoss: true)
> ---
>
> Key: SPARK-21248
> URL: https://issues.apache.org/jira/browse/SPARK-21248
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>
> {code}
> org.scalatest.exceptions.TestFailedException:  Stream Thread Died: null 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
>   scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)  
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)  
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)  
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)  
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)  
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)  
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)  
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335)
>== Progress ==AssertOnQuery(, )CheckAnswer: 
> [-20],[-21],[-22],[0],[1],[2],[11],[12],[22]StopStream
> StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@6c63901,Map())
> CheckAnswer: [-20],[-21],[-22],[0],[1],[2],[11],[12],[22]
> AddKafkaData(topics = Set(topic-7), data = WrappedArray(30, 31, 32, 33, 34), 
> message = )CheckAnswer: 
> [-20],[-21],[-22],[0],[1],[2],[11],[12],[22],[30],[31],[32],[33],[34]
> StopStream  == Stream == Output Mode: Append Stream state: not started Thread 
> state: dead  java.lang.InterruptedException  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
>   at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) 
>  at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)  at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)  at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)  at 
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)  at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)  at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)  at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:375)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)
>== Sink == 0: [-20] [-21] [-22] [22] [11] [12] [0] [1] [2] 1: [30] 2: [33] 
> [31] [32] [34]   == Plan ==   
> {code}
> See 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/3173/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets__failOnDataLoss__true_/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21248) Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets (failOnDataLoss: true)

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21248:


Assignee: Apache Spark

> Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets 
> (failOnDataLoss: true)
> ---
>
> Key: SPARK-21248
> URL: https://issues.apache.org/jira/browse/SPARK-21248
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> {code}
> org.scalatest.exceptions.TestFailedException:  Stream Thread Died: null 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
>   scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)  
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)  
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)  
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)  
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)  
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)  
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)  
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335)
>== Progress ==AssertOnQuery(, )CheckAnswer: 
> [-20],[-21],[-22],[0],[1],[2],[11],[12],[22]StopStream
> StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@6c63901,Map())
> CheckAnswer: [-20],[-21],[-22],[0],[1],[2],[11],[12],[22]
> AddKafkaData(topics = Set(topic-7), data = WrappedArray(30, 31, 32, 33, 34), 
> message = )CheckAnswer: 
> [-20],[-21],[-22],[0],[1],[2],[11],[12],[22],[30],[31],[32],[33],[34]
> StopStream  == Stream == Output Mode: Append Stream state: not started Thread 
> state: dead  java.lang.InterruptedException  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
>   at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) 
>  at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)  at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)  at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)  at 
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)  at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)  at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)  at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:375)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)
>== Sink == 0: [-20] [-21] [-22] [22] [11] [12] [0] [1] [2] 1: [30] 2: [33] 
> [31] [32] [34]   == Plan ==   
> {code}
> See 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/3173/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets__failOnDataLoss__true_/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21248) Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets (failOnDataLoss: true)

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21248:


Assignee: (was: Apache Spark)

> Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets 
> (failOnDataLoss: true)
> ---
>
> Key: SPARK-21248
> URL: https://issues.apache.org/jira/browse/SPARK-21248
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Shixiong Zhu
>
> {code}
> org.scalatest.exceptions.TestFailedException:  Stream Thread Died: null 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
>   scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)  
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)  
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)  
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)  
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)  
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)  
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)  
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335)
>== Progress ==AssertOnQuery(, )CheckAnswer: 
> [-20],[-21],[-22],[0],[1],[2],[11],[12],[22]StopStream
> StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@6c63901,Map())
> CheckAnswer: [-20],[-21],[-22],[0],[1],[2],[11],[12],[22]
> AddKafkaData(topics = Set(topic-7), data = WrappedArray(30, 31, 32, 33, 34), 
> message = )CheckAnswer: 
> [-20],[-21],[-22],[0],[1],[2],[11],[12],[22],[30],[31],[32],[33],[34]
> StopStream  == Stream == Output Mode: Append Stream state: not started Thread 
> state: dead  java.lang.InterruptedException  at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
>   at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) 
>  at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)  at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)  at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)  at 
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)  at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)  at 
> org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)  at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:375)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)
>== Sink == 0: [-20] [-21] [-22] [22] [11] [12] [0] [1] [2] 1: [30] 2: [33] 
> [31] [32] [34]   == Plan ==   
> {code}
> See 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/3173/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets__failOnDataLoss__true_/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21248) Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign from specific offsets (failOnDataLoss: true)

2017-06-28 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-21248:


 Summary: Flaky test: o.a.s.sql.kafka010.KafkaSourceSuite.assign 
from specific offsets (failOnDataLoss: true)
 Key: SPARK-21248
 URL: https://issues.apache.org/jira/browse/SPARK-21248
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Shixiong Zhu


{code}
org.scalatest.exceptions.TestFailedException:  Stream Thread Died: null 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
  scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)  
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)  
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)  
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)  
org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)  
org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)  
org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)  
org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
  
org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335)
   == Progress ==AssertOnQuery(, )CheckAnswer: 
[-20],[-21],[-22],[0],[1],[2],[11],[12],[22]StopStream
StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@6c63901,Map())  
  CheckAnswer: [-20],[-21],[-22],[0],[1],[2],[11],[12],[22]
AddKafkaData(topics = Set(topic-7), data = WrappedArray(30, 31, 32, 33, 34), 
message = )CheckAnswer: 
[-20],[-21],[-22],[0],[1],[2],[11],[12],[22],[30],[31],[32],[33],[34]
StopStream  == Stream == Output Mode: Append Stream state: not started Thread 
state: dead  java.lang.InterruptedException  at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
  at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)  
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)  at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)  at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)  at 
org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)  at 
org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)  at 
org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)  at 
org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.deactivateInstances(StateStoreCoordinator.scala:108)
  at 
org.apache.spark.sql.streaming.StreamingQueryManager.notifyQueryTermination(StreamingQueryManager.scala:335)
  at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:375)
  at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211)
   == Sink == 0: [-20] [-21] [-22] [22] [11] [12] [0] [1] [2] 1: [30] 2: [33] 
[31] [32] [34]   == Plan ==   
{code}


See 
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.6/3173/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/assign_from_specific_offsets__failOnDataLoss__true_/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21247) Allow case-insensitive type equality in Set operation

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21247:


Assignee: (was: Apache Spark)

> Allow case-insensitive type equality in Set operation
> -
>
> Key: SPARK-21247
> URL: https://issues.apache.org/jira/browse/SPARK-21247
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Dongjoon Hyun
>
> Spark supports case-sensitivity in columns. Especially, for Struct types, 
> with case sensitive option, the following is supported.
> {code}
> scala> sql("select named_struct('a', 1, 'A', 2).a").show
> +--+
> |named_struct(a, 1, A, 2).a|
> +--+
> | 1|
> +--+
> scala> sql("select named_struct('a', 1, 'A', 2).A").show
> +--+
> |named_struct(a, 1, A, 2).A|
> +--+
> | 2|
> +--+
> {code}
> And vice versa, with case sensitive `false`, the following is supported.
> {code}
> scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show
> +++
> |named_struct(a, 1).A|named_struct(A, 1).a|
> +++
> |   1|   1|
> +++
> {code}
> This issue aims to support case-insensitive type comparisions in Set 
> operation. Currently, SET operations fail due to case-sensitive type 
> comparision failure .
> {code}
> scala> sql("SELECT struct(1 a) UNION ALL (SELECT struct(2 A))").show
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. struct <> struct at the first 
> column of the second table;;
> 'Union
> :- Project [named_struct(a, 1) AS named_struct(a, 1 AS `a`)#2]
> :  +- OneRowRelation$
> +- Project [named_struct(A, 2) AS named_struct(A, 2 AS `A`)#3]
>+- OneRowRelation$
> {code}
> Please note that this issue does not aim to change all type comparison 
> semantics.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21247) Allow case-insensitive type equality in Set operation

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067379#comment-16067379
 ] 

Apache Spark commented on SPARK-21247:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/18460

> Allow case-insensitive type equality in Set operation
> -
>
> Key: SPARK-21247
> URL: https://issues.apache.org/jira/browse/SPARK-21247
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Dongjoon Hyun
>
> Spark supports case-sensitivity in columns. Especially, for Struct types, 
> with case sensitive option, the following is supported.
> {code}
> scala> sql("select named_struct('a', 1, 'A', 2).a").show
> +--+
> |named_struct(a, 1, A, 2).a|
> +--+
> | 1|
> +--+
> scala> sql("select named_struct('a', 1, 'A', 2).A").show
> +--+
> |named_struct(a, 1, A, 2).A|
> +--+
> | 2|
> +--+
> {code}
> And vice versa, with case sensitive `false`, the following is supported.
> {code}
> scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show
> +++
> |named_struct(a, 1).A|named_struct(A, 1).a|
> +++
> |   1|   1|
> +++
> {code}
> This issue aims to support case-insensitive type comparisions in Set 
> operation. Currently, SET operations fail due to case-sensitive type 
> comparision failure .
> {code}
> scala> sql("SELECT struct(1 a) UNION ALL (SELECT struct(2 A))").show
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. struct <> struct at the first 
> column of the second table;;
> 'Union
> :- Project [named_struct(a, 1) AS named_struct(a, 1 AS `a`)#2]
> :  +- OneRowRelation$
> +- Project [named_struct(A, 2) AS named_struct(A, 2 AS `A`)#3]
>+- OneRowRelation$
> {code}
> Please note that this issue does not aim to change all type comparison 
> semantics.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21247) Allow case-insensitive type equality in Set operation

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21247:


Assignee: Apache Spark

> Allow case-insensitive type equality in Set operation
> -
>
> Key: SPARK-21247
> URL: https://issues.apache.org/jira/browse/SPARK-21247
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> Spark supports case-sensitivity in columns. Especially, for Struct types, 
> with case sensitive option, the following is supported.
> {code}
> scala> sql("select named_struct('a', 1, 'A', 2).a").show
> +--+
> |named_struct(a, 1, A, 2).a|
> +--+
> | 1|
> +--+
> scala> sql("select named_struct('a', 1, 'A', 2).A").show
> +--+
> |named_struct(a, 1, A, 2).A|
> +--+
> | 2|
> +--+
> {code}
> And vice versa, with case sensitive `false`, the following is supported.
> {code}
> scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show
> +++
> |named_struct(a, 1).A|named_struct(A, 1).a|
> +++
> |   1|   1|
> +++
> {code}
> This issue aims to support case-insensitive type comparisions in Set 
> operation. Currently, SET operations fail due to case-sensitive type 
> comparision failure .
> {code}
> scala> sql("SELECT struct(1 a) UNION ALL (SELECT struct(2 A))").show
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. struct <> struct at the first 
> column of the second table;;
> 'Union
> :- Project [named_struct(a, 1) AS named_struct(a, 1 AS `a`)#2]
> :  +- OneRowRelation$
> +- Project [named_struct(A, 2) AS named_struct(A, 2 AS `A`)#3]
>+- OneRowRelation$
> {code}
> Please note that this issue does not aim to change all type comparison 
> semantics.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21247) Allow case-insensitive type equality in Set operation

2017-06-28 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21247:
--
Summary: Allow case-insensitive type equality in Set operation  (was: Allow 
case-insensitive type comparisions in Set operation)

> Allow case-insensitive type equality in Set operation
> -
>
> Key: SPARK-21247
> URL: https://issues.apache.org/jira/browse/SPARK-21247
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Dongjoon Hyun
>
> Spark supports case-sensitivity in columns. Especially, for Struct types, 
> with case sensitive option, the following is supported.
> {code}
> scala> sql("select named_struct('a', 1, 'A', 2).a").show
> +--+
> |named_struct(a, 1, A, 2).a|
> +--+
> | 1|
> +--+
> scala> sql("select named_struct('a', 1, 'A', 2).A").show
> +--+
> |named_struct(a, 1, A, 2).A|
> +--+
> | 2|
> +--+
> {code}
> And vice versa, with case sensitive `false`, the following is supported.
> {code}
> scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show
> +++
> |named_struct(a, 1).A|named_struct(A, 1).a|
> +++
> |   1|   1|
> +++
> {code}
> This issue aims to support case-insensitive type comparisions in Set 
> operation. Currently, SET operations fail due to case-sensitive type 
> comparision failure .
> {code}
> scala> sql("SELECT struct(1 a) UNION ALL (SELECT struct(2 A))").show
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. struct <> struct at the first 
> column of the second table;;
> 'Union
> :- Project [named_struct(a, 1) AS named_struct(a, 1 AS `a`)#2]
> :  +- OneRowRelation$
> +- Project [named_struct(A, 2) AS named_struct(A, 2 AS `A`)#3]
>+- OneRowRelation$
> {code}
> Please note that this issue does not aim to change all type comparison 
> semantics.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21247) Allow case-insensitive type comparisions in Set operation

2017-06-28 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-21247:
-

 Summary: Allow case-insensitive type comparisions in Set operation
 Key: SPARK-21247
 URL: https://issues.apache.org/jira/browse/SPARK-21247
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Dongjoon Hyun


Spark supports case-sensitivity in columns. Especially, for Struct types, with 
case sensitive option, the following is supported.

{code}
scala> sql("select named_struct('a', 1, 'A', 2).a").show
+--+
|named_struct(a, 1, A, 2).a|
+--+
| 1|
+--+

scala> sql("select named_struct('a', 1, 'A', 2).A").show
+--+
|named_struct(a, 1, A, 2).A|
+--+
| 2|
+--+
{code}

And vice versa, with case sensitive `false`, the following is supported.
{code}
scala> sql("select named_struct('a', 1).A, named_struct('A', 1).a").show
+++
|named_struct(a, 1).A|named_struct(A, 1).a|
+++
|   1|   1|
+++
{code}

This issue aims to support case-insensitive type comparisions in Set operation. 
Currently, SET operations fail due to case-sensitive type comparision failure .

{code}
scala> sql("SELECT struct(1 a) UNION ALL (SELECT struct(2 A))").show
org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
with the compatible column types. struct <> struct at the first 
column of the second table;;
'Union
:- Project [named_struct(a, 1) AS named_struct(a, 1 AS `a`)#2]
:  +- OneRowRelation$
+- Project [named_struct(A, 2) AS named_struct(A, 2 AS `A`)#3]
   +- OneRowRelation$
{code}

Please note that this issue does not aim to change all type comparison 
semantics.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21242) Allow spark executors to function in mesos w/ container networking enabled

2017-06-28 Thread John Leach (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067257#comment-16067257
 ] 

John Leach commented on SPARK-21242:


[~mgummelt] We are using this for our service and it seems to be working 
integrated with Calico.  Thanks for SPARK-18232 which gave us a bit of 
guidance.  

> Allow spark executors to function in mesos w/ container networking enabled
> --
>
> Key: SPARK-21242
> URL: https://issues.apache.org/jira/browse/SPARK-21242
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.1.1
>Reporter: Tara Gildersleeve
> Attachments: patch_1.patch
>
>
> Allow spark executors to function in mesos w/ container networking enabled



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21184) QuantileSummaries implementation is wrong and QuantileSummariesSuite fails with larger n

2017-06-28 Thread Andrew Ray (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067167#comment-16067167
 ] 

Andrew Ray commented on SPARK-21184:


Also the lookup queries are just wrong

{code}
scala> Seq(1, 2).toDF("a").selectExpr("percentile_approx(a, 0.001)").head
res9: org.apache.spark.sql.Row = [2.0]
{code}


> QuantileSummaries implementation is wrong and QuantileSummariesSuite fails 
> with larger n
> 
>
> Key: SPARK-21184
> URL: https://issues.apache.org/jira/browse/SPARK-21184
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Andrew Ray
>
> 1. QuantileSummaries implementation does not match the paper it is supposed 
> to be based on.
> 1a. The compress method 
> (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L240)
>  merges neighboring buckets, but thats not what the paper says to do. The 
> paper 
> (http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf) 
> describes an implicit tree structure and the compress method deletes selected 
> subtrees.
> 1b. The paper does not discuss merging these summary data structures at all. 
> The following comment is in the merge method of QuantileSummaries:
> {quote}  // The GK algorithm is a bit unclear about it, but it seems 
> there is no need to adjust the
>   // statistics during the merging: the invariants are still respected 
> after the merge.{quote}
> Unless I'm missing something that needs substantiation, it's not clear that 
> that the invariants hold.
> 2. QuantileSummariesSuite fails with n = 1 (and other non trivial values)
> https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala#L27
> One possible solution if these issues can't be resolved would be to move to 
> an algorithm that explicitly supports merging and is well tested like 
> https://github.com/tdunning/t-digest



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21246) Unexpected Data Type conversion from LONG to BIGINT

2017-06-28 Thread Monica Raj (JIRA)
Monica Raj created SPARK-21246:
--

 Summary: Unexpected Data Type conversion from LONG to BIGINT
 Key: SPARK-21246
 URL: https://issues.apache.org/jira/browse/SPARK-21246
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
 Environment: Using Zeppelin Notebook or Spark Shell
Reporter: Monica Raj


The unexpected conversion occurred when creating a data frame out of an 
existing data collection. The following code can be run in zeppelin notebook to 
reproduce the bug:

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val schemaString = "name"
val lstVals = Seq(3)
val rowRdd = sc.parallelize(lstVals).map(x => Row( x ))
rowRdd.collect()
// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, LongType, nullable = true))
val schema = StructType(fields)
print(schema)
val peopleDF = sqlContext.createDataFrame(rowRdd, schema)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21245) Resolve code duplication for classification/regression summarizers

2017-06-28 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-21245:


 Summary: Resolve code duplication for classification/regression 
summarizers
 Key: SPARK-21245
 URL: https://issues.apache.org/jira/browse/SPARK-21245
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.2.1
Reporter: Seth Hendrickson


In several places (LogReg, LinReg, SVC) in Spark ML, we collect summary 
information about training data using {{MultivariateOnlineSummarizer}} and 
{{MulticlassSummarizer}}. We have the same code appearing in several places 
(and including test suites). We can eliminate this by creating a common 
implementation somewhere.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13534) Implement Apache Arrow serializer for Spark DataFrame for use in DataFrame.toPandas

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066938#comment-16066938
 ] 

Apache Spark commented on SPARK-13534:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/18459

> Implement Apache Arrow serializer for Spark DataFrame for use in 
> DataFrame.toPandas
> ---
>
> Key: SPARK-13534
> URL: https://issues.apache.org/jira/browse/SPARK-13534
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Wes McKinney
>Assignee: Bryan Cutler
> Fix For: 2.3.0
>
> Attachments: benchmark.py
>
>
> The current code path for accessing Spark DataFrame data in Python using 
> PySpark passes through an inefficient serialization-deserialiation process 
> that I've examined at a high level here: 
> https://gist.github.com/wesm/0cb5531b1c2e346a0007. Currently, RDD[Row] 
> objects are being deserialized in pure Python as a list of tuples, which are 
> then converted to pandas.DataFrame using its {{from_records}} alternate 
> constructor. This also uses a large amount of memory.
> For flat (no nested types) schemas, the Apache Arrow memory layout 
> (https://github.com/apache/arrow/tree/master/format) can be deserialized to 
> {{pandas.DataFrame}} objects with comparatively small overhead compared with 
> memcpy / system memory bandwidth -- Arrow's bitmasks must be examined, 
> replacing the corresponding null values with pandas's sentinel values (None 
> or NaN as appropriate).
> I will be contributing patches to Arrow in the coming weeks for converting 
> between Arrow and pandas in the general case, so if Spark can send Arrow 
> memory to PySpark, we will hopefully be able to increase the Python data 
> access throughput by an order of magnitude or more. I propose to add an new 
> serializer for Spark DataFrame and a new method that can be invoked from 
> PySpark to request a Arrow memory-layout byte stream, prefixed by a data 
> header indicating array buffer offsets and sizes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21216) Streaming DataFrames fail to join with Hive tables

2017-06-28 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-21216.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

> Streaming DataFrames fail to join with Hive tables
> --
>
> Key: SPARK-21216
> URL: https://issues.apache.org/jira/browse/SPARK-21216
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.1
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.3.0
>
>
> The following code will throw a cryptic exception:
> {code}
> import org.apache.spark.sql.execution.streaming.MemoryStream
> import testImplicits._
> implicit val _sqlContext = spark.sqlContext
> Seq((1, "one"), (2, "two"), (4, "four")).toDF("number", 
> "word").createOrReplaceTempView("t1")
> // Make a table and ensure it will be broadcast.
> sql("""CREATE TABLE smallTable(word string, number int)
>   |ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
>   |STORED AS TEXTFILE
> """.stripMargin)
> sql(
>   """INSERT INTO smallTable
> |SELECT word, number from t1
>   """.stripMargin)
> val inputData = MemoryStream[Int]
> val joined = inputData.toDS().toDF()
>   .join(spark.table("smallTable"), $"value" === $"number")
> val sq = joined.writeStream
>   .format("memory")
>   .queryName("t2")
>   .start()
> try {
>   inputData.addData(1, 2)
>   sq.processAllAvailable()
> } finally {
>   sq.stop()
> }
> {code}
> If someone creates a HiveSession, the planner in `IncrementalExecution` 
> doesn't take into account the Hive scan strategies



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20889) SparkR grouped documentation for Column methods

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066910#comment-16066910
 ] 

Apache Spark commented on SPARK-20889:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/18458

> SparkR grouped documentation for Column methods
> ---
>
> Key: SPARK-20889
> URL: https://issues.apache.org/jira/browse/SPARK-20889
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: documentation
> Fix For: 2.3.0
>
>
> Group the documentation of individual methods defined for the Column class. 
> This aims to create the following improvements:
> - Centralized documentation for easy navigation (user can view multiple 
> related methods on one single page).
> - Reduced number of items in Seealso.
> - Betters examples using shared data. This avoids creating a data frame for 
> each function if they are documented separately. And more importantly, user 
> can copy and paste to run them directly!
> - Cleaner structure and much fewer Rd files (remove a large number of Rd 
> files).
> - Remove duplicated definition of param (since they share exactly the same 
> argument).
> - No need to write meaningless examples for trivial functions (because of 
> grouping).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21241) Add intercept to StreamingLinearRegressionWithSGD

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21241:


Assignee: Apache Spark

> Add intercept to StreamingLinearRegressionWithSGD
> -
>
> Key: SPARK-21241
> URL: https://issues.apache.org/jira/browse/SPARK-21241
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Soulaimane GUEDRIA
>Assignee: Apache Spark
>  Labels: patch
> Fix For: 2.3.0
>
>
> StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept 
> Method which offers the possibility to turn on/off the intercept value. API 
> parity is not respected between Python and Scala.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21241) Add intercept to StreamingLinearRegressionWithSGD

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066893#comment-16066893
 ] 

Apache Spark commented on SPARK-21241:
--

User 'SoulGuedria' has created a pull request for this issue:
https://github.com/apache/spark/pull/18457

> Add intercept to StreamingLinearRegressionWithSGD
> -
>
> Key: SPARK-21241
> URL: https://issues.apache.org/jira/browse/SPARK-21241
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Soulaimane GUEDRIA
>  Labels: patch
> Fix For: 2.3.0
>
>
> StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept 
> Method which offers the possibility to turn on/off the intercept value. API 
> parity is not respected between Python and Scala.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21241) Add intercept to StreamingLinearRegressionWithSGD

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21241:


Assignee: (was: Apache Spark)

> Add intercept to StreamingLinearRegressionWithSGD
> -
>
> Key: SPARK-21241
> URL: https://issues.apache.org/jira/browse/SPARK-21241
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Soulaimane GUEDRIA
>  Labels: patch
> Fix For: 2.3.0
>
>
> StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept 
> Method which offers the possibility to turn on/off the intercept value. API 
> parity is not respected between Python and Scala.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21244) KMeans applied to processed text day clumps almost all documents into one cluster

2017-06-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066833#comment-16066833
 ] 

Sean Owen commented on SPARK-21244:
---

There's no detail here that suggests a Spark bug. Depending on your docs and 
your k, this might be correct.

> KMeans applied to processed text day clumps almost all documents into one 
> cluster
> -
>
> Key: SPARK-21244
> URL: https://issues.apache.org/jira/browse/SPARK-21244
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Nassir
>
> I have observed this problem for quite a while now regarding the 
> implementation of pyspark KMeans on text documents - to cluster documents 
> according to their TF-IDF vectors. The pyspark implementation - even on 
> standard datasets - clusters almost all of the documents into one cluster. 
> I implemented K-means on the same dataset with same parameters using SKlearn 
> library, and this clusters the documents very well. 
> I recommend anyone who is able to test the pyspark implementation of KMeans 
> on text documents - which obviously has a bug in it somewhere.
> (currently I am convert my spark dataframe to pandas dataframe and running k 
> means and converting back. However, this is of course not a parallel solution 
> capable of handling huge amounts of data in future)
> Here is a link to the question i posted a while back on stackoverlfow: 
> https://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20082) Incremental update of LDA model, by adding initialModel as start point

2017-06-28 Thread Mathieu DESPRIEE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066816#comment-16066816
 ] 

Mathieu DESPRIEE commented on SPARK-20082:
--

I updated the PR.

Basically, here is the approach :
- only Online optimizer is supported, any use with EM optimizer is rejected. If 
incremental is also desirable for EM, I suggest we open another JIRA for it, to 
take the time discussing the initialization with an existing graph and new 
documents.
- I added an {{initialModel}} parameter that is used to initialize doc 
concentration and topic matrix from it.

 [~yuhaoyan], could you check it please ?

> Incremental update of LDA model, by adding initialModel as start point
> --
>
> Key: SPARK-20082
> URL: https://issues.apache.org/jira/browse/SPARK-20082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Mathieu DESPRIEE
>
> Some mllib models support an initialModel to start from and update it 
> incrementally with new data.
> From what I understand of OnlineLDAOptimizer, it is possible to incrementally 
> update an existing model with batches of new documents.
> I suggest to add an initialModel as a start point for LDA.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21244) KMeans applied to processed text day clumps almost all documents into one cluster

2017-06-28 Thread Nassir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nassir updated SPARK-21244:
---
Description: 
I have observed this problem for quite a while now regarding the implementation 
of pyspark KMeans on text documents - to cluster documents according to their 
TF-IDF vectors. The pyspark implementation - even on standard datasets - 
clusters almost all of the documents into one cluster. 

I implemented K-means on the same dataset with same parameters using SKlearn 
library, and this clusters the documents very well. 

I recommend anyone who is able to test the pyspark implementation of KMeans on 
text documents - which obviously has a bug in it somewhere.

(currently I am convert my spark dataframe to pandas dataframe and running k 
means and converting back. However, this is of course not a parallel solution 
capable of handling huge amounts of data in future)

Here is a link to the question i posted a while back on stackoverlfow: 
https://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one

  was:
I have observed this problem for quite a while now regarding the implementation 
of pyspark KMeans on text documents - to cluster documents according to their 
TF-IDF vectors. The pyspark implementation - even on standard datasets - 
clusters almost all of the documents into one cluster. 

I implemented K-means on the same dataset with same parameters using SKlearn 
library, and this clusters the documents very well. 

I recommend anyone who is able to test the pyspark implementation of KMeans on 
text documents - which obviously has a bug in it somewhere.

(currently I am convert my spark dataframe to pandas dataframe and running k 
means and converting back. However, this is of course not a parallel solution 
capable of handling huge amounts of data in future)


> KMeans applied to processed text day clumps almost all documents into one 
> cluster
> -
>
> Key: SPARK-21244
> URL: https://issues.apache.org/jira/browse/SPARK-21244
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Nassir
>
> I have observed this problem for quite a while now regarding the 
> implementation of pyspark KMeans on text documents - to cluster documents 
> according to their TF-IDF vectors. The pyspark implementation - even on 
> standard datasets - clusters almost all of the documents into one cluster. 
> I implemented K-means on the same dataset with same parameters using SKlearn 
> library, and this clusters the documents very well. 
> I recommend anyone who is able to test the pyspark implementation of KMeans 
> on text documents - which obviously has a bug in it somewhere.
> (currently I am convert my spark dataframe to pandas dataframe and running k 
> means and converting back. However, this is of course not a parallel solution 
> capable of handling huge amounts of data in future)
> Here is a link to the question i posted a while back on stackoverlfow: 
> https://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21244) KMeans applied to processed text day clumps almost all documents into one cluster

2017-06-28 Thread Nassir (JIRA)
Nassir created SPARK-21244:
--

 Summary: KMeans applied to processed text day clumps almost all 
documents into one cluster
 Key: SPARK-21244
 URL: https://issues.apache.org/jira/browse/SPARK-21244
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.1.1
Reporter: Nassir


I have observed this problem for quite a while now regarding the implementation 
of pyspark KMeans on text documents - to cluster documents according to their 
TF-IDF vectors. The pyspark implementation - even on standard datasets - 
clusters almost all of the documents into one cluster. 

I implemented K-means on the same dataset with same parameters using SKlearn 
library, and this clusters the documents very well. 

I recommend anyone who is able to test the pyspark implementation of KMeans on 
text documents - which obviously has a bug in it somewhere.

(currently I am convert my spark dataframe to pandas dataframe and running k 
means and converting back. However, this is of course not a parallel solution 
capable of handling huge amounts of data in future)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20696) tf-idf document clustering with K-means in Apache Spark putting points into one cluster

2017-06-28 Thread Nassir (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066652#comment-16066652
 ] 

Nassir commented on SPARK-20696:


Unfortunately, I have not found a place to make this known to the spark 
community yet.

My workaround has been to convert pyspark dataframe to pandas dataframe, use 
sklearn python K-Means to cluster documents (which works well), then convert 
pandas dataframe back to pyspark.

It works in my situation as the number of documents I am clustering is 
relatively small. However, I will want to process Big Data and would need a 
solution in pyspark with spark streaming in fuutre

Nassir

> tf-idf document clustering with K-means in Apache Spark putting points into 
> one cluster
> ---
>
> Key: SPARK-20696
> URL: https://issues.apache.org/jira/browse/SPARK-20696
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Nassir
>
> I am trying to do the classic job of clustering text documents by 
> pre-processing, generating tf-idf matrix, and then applying K-means. However, 
> testing this workflow on the classic 20NewsGroup dataset results in most 
> documents being clustered into one cluster. (I have initially tried to 
> cluster all documents from 6 of the 20 groups - so expecting to cluster into 
> 6 clusters).
> I am implementing this in Apache Spark as my purpose is to utilise this 
> technique on millions of documents. Here is the code written in Pyspark on 
> Databricks:
> #declare path to folder containing 6 of 20 news group categories
> path = "/mnt/%s/20news-bydate.tar/20new-bydate-train-lessFolders/*/*" % 
> MOUNT_NAME
> #read all the text files from the 6 folders. Each entity is an entire 
> document. 
> text_files = sc.wholeTextFiles(path).cache()
> #convert rdd to dataframe
> df = text_files.toDF(["filePath", "document"]).cache()
> from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer 
> #tokenize the document text
> tokenizer = Tokenizer(inputCol="document", outputCol="tokens")
> tokenized = tokenizer.transform(df).cache()
> from pyspark.ml.feature import StopWordsRemover
> remover = StopWordsRemover(inputCol="tokens", 
> outputCol="stopWordsRemovedTokens")
> stopWordsRemoved_df = remover.transform(tokenized).cache()
> hashingTF = HashingTF (inputCol="stopWordsRemovedTokens", 
> outputCol="rawFeatures", numFeatures=20)
> tfVectors = hashingTF.transform(stopWordsRemoved_df).cache()
> idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5)
> idfModel = idf.fit(tfVectors)
> tfIdfVectors = idfModel.transform(tfVectors).cache()
> #note that I have also tried to use normalized data, but get the same result
> from pyspark.ml.feature import Normalizer
> from pyspark.ml.linalg import Vectors
> normalizer = Normalizer(inputCol="features", outputCol="normFeatures")
> l2NormData = normalizer.transform(tfIdfVectors)
> from pyspark.ml.clustering import KMeans
> # Trains a KMeans model.
> kmeans = KMeans().setK(6).setMaxIter(20)
> km_model = kmeans.fit(l2NormData)
> clustersTable = km_model.transform(l2NormData)
> [output showing most documents get clustered into cluster 0][1]
> ID number_of_documents_in_cluster 
> 0 3024 
> 3 5 
> 1 3 
> 5 2
> 2 2 
> 4 1
> As you can see most of my data points get clustered into cluster 0, and I 
> cannot figure out what I am doing wrong as all the tutorials and code I have 
> come across online point to using this method.
> In addition I have also tried normalizing the tf-idf matrix before K-means 
> but that also produces the same result. I know cosine distance is a better 
> measure to use, but I expected using standard K-means in Apache Spark would 
> provide meaningful results.
> Can anyone help with regards to whether I have a bug in my code, or if 
> something is missing in my data clustering pipeline?
> (Question also asked in Stackoverflow before: 
> http://stackoverflow.com/questions/43863373/tf-idf-document-clustering-with-k-means-in-apache-spark-putting-points-into-one)
> Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21241) Add intercept to StreamingLinearRegressionWithSGD

2017-06-28 Thread Soulaimane GUEDRIA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Soulaimane GUEDRIA updated SPARK-21241:
---
Description: StreamingLinearRegressionWithSGD class in PySpark is missing 
the setIntercept Method which offers the possibility to turn on/off the 
intercept value. API parity is not respected between Python and Scala.  (was: 
StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept 
Method which offers the possibility to turn on/off the intercept value. API 
parity is not achieved with Scala API.)

> Add intercept to StreamingLinearRegressionWithSGD
> -
>
> Key: SPARK-21241
> URL: https://issues.apache.org/jira/browse/SPARK-21241
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Soulaimane GUEDRIA
>  Labels: patch
> Fix For: 2.3.0
>
>
> StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept 
> Method which offers the possibility to turn on/off the intercept value. API 
> parity is not respected between Python and Scala.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21241) Add intercept to StreamingLinearRegressionWithSGD

2017-06-28 Thread Soulaimane GUEDRIA (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Soulaimane GUEDRIA updated SPARK-21241:
---
Summary: Add intercept to StreamingLinearRegressionWithSGD  (was: Can't add 
intercept to StreamingLinearRegressionWithSGD)

> Add intercept to StreamingLinearRegressionWithSGD
> -
>
> Key: SPARK-21241
> URL: https://issues.apache.org/jira/browse/SPARK-21241
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.3.0
>Reporter: Soulaimane GUEDRIA
>  Labels: patch
> Fix For: 2.3.0
>
>
> StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept 
> Method which offers the possibility to turn on/off the intercept value. API 
> parity is not achieved with Scala API.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21242) Allow spark executors to function in mesos w/ container networking enabled

2017-06-28 Thread Tara Gildersleeve (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tara Gildersleeve updated SPARK-21242:
--
Priority: Major  (was: Minor)

> Allow spark executors to function in mesos w/ container networking enabled
> --
>
> Key: SPARK-21242
> URL: https://issues.apache.org/jira/browse/SPARK-21242
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.1.1
>Reporter: Tara Gildersleeve
> Attachments: patch_1.patch
>
>
> Allow spark executors to function in mesos w/ container networking enabled



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21233) Support pluggable offset storage

2017-06-28 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066517#comment-16066517
 ] 

Cody Koeninger commented on SPARK-21233:


You already have the choice of where you want to store offsets.

http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#storing-offsets

> Support pluggable offset storage
> 
>
> Key: SPARK-21233
> URL: https://issues.apache.org/jira/browse/SPARK-21233
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there 
> are a lot of streaming program running in the cluster the ZooKeeper Cluster's 
> loading is very high . Maybe Zookeeper is not very suitable to save offset 
> periodicity.
> This issue is wish to support a pluggable offset storage to avoid save it in 
> the zookeeper 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21243) Limit the number of maps in a single shuffle fetch

2017-06-28 Thread Dhruve Ashar (JIRA)
Dhruve Ashar created SPARK-21243:


 Summary: Limit the number of maps in a single shuffle fetch
 Key: SPARK-21243
 URL: https://issues.apache.org/jira/browse/SPARK-21243
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.1, 2.1.0
Reporter: Dhruve Ashar
Priority: Minor


Right now spark can limit the # of parallel fetches and also limits the amount 
of data in one fetch, but one fetch to a host could be for 100's of blocks. In 
one instance we saw 450+ blocks. When you have 100's of those and 1000's of 
reducers fetching that becomes a lot of metadata and can run the Node Manager 
out of memory. We should add a config to limit the # of maps per fetch to 
reduce the load on the NM.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21242) Allow spark executors to function in mesos w/ container networking enabled

2017-06-28 Thread Tara Gildersleeve (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tara Gildersleeve updated SPARK-21242:
--
Attachment: patch_1.patch

> Allow spark executors to function in mesos w/ container networking enabled
> --
>
> Key: SPARK-21242
> URL: https://issues.apache.org/jira/browse/SPARK-21242
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 2.1.1
>Reporter: Tara Gildersleeve
>Priority: Minor
> Attachments: patch_1.patch
>
>
> Allow spark executors to function in mesos w/ container networking enabled



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21242) Allow spark executors to function in mesos w/ container networking enabled

2017-06-28 Thread Tara Gildersleeve (JIRA)
Tara Gildersleeve created SPARK-21242:
-

 Summary: Allow spark executors to function in mesos w/ container 
networking enabled
 Key: SPARK-21242
 URL: https://issues.apache.org/jira/browse/SPARK-21242
 Project: Spark
  Issue Type: New Feature
  Components: Mesos
Affects Versions: 2.1.1
Reporter: Tara Gildersleeve
Priority: Minor


Allow spark executors to function in mesos w/ container networking enabled



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21241) Can't add intercept to StreamingLinearRegressionWithSGD

2017-06-28 Thread Soulaimane GUEDRIA (JIRA)
Soulaimane GUEDRIA created SPARK-21241:
--

 Summary: Can't add intercept to StreamingLinearRegressionWithSGD
 Key: SPARK-21241
 URL: https://issues.apache.org/jira/browse/SPARK-21241
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 2.3.0
Reporter: Soulaimane GUEDRIA
 Fix For: 2.3.0


StreamingLinearRegressionWithSGD class in PySpark is missing the setIntercept 
Method which offers the possibility to turn on/off the intercept value. API 
parity is not achieved with Scala API.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe

2017-06-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066471#comment-16066471
 ] 

Sean Owen commented on SPARK-21227:
---

Yes, I think this is ultimately related to two different locales disagreeing on 
what the lower- or upper-case of the string it. This Turkish character is the 
most common trigger. We addressed this in Spark a while ago. However we left 
most any string that is generated by a user application untouched, on the 
assumption they may want locale-specific behavior (and to lessen the scope of 
the change).

This behavior isn't expected, even if it's a somewhat contrived example you can 
work around. Something is doing a locale-insensitive toLowerCase, which maybe 
shouldn't.

> Unicode in Json field causes AnalysisException when selecting from Dataframe
> 
>
> Key: SPARK-21227
> URL: https://issues.apache.org/jira/browse/SPARK-21227
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Seydou Dia
>
> Hi,
> please find below the step to reproduce the issue I am facing.
> First I create a json with 2 fields:
> * city_name
> * cıty_name
> The first one is valid ascii, while the second contains a unicode (ı, i 
> without dot ).
> When I try to select from the dataframe I have an  {noformat} 
> AnalysisException {noformat}.
> {code:python}
> $ pyspark
> Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
> Attempting port 4042.
> 17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
> SparkSession available as 'spark'.
> >>> sc=spark.sparkContext
> >>> js = ['{"city_name": "paris"}'
> ... , '{"city_name": "rome"}'
> ... , '{"city_name": "berlin"}'
> ... , '{"cıty_name": "new-york"}'
> ... , '{"cıty_name": "toronto"}'
> ... , '{"cıty_name": "chicago"}'
> ... , '{"cıty_name": "dubai"}']
> >>> myRDD = sc.parallelize(js)
> >>> myDF = spark.read.json(myRDD)
> >>> myDF.printSchema()
> >>>   
> root
>  |-- city_name: string (nullable = true)
>  |-- cıty_name: string (nullable = true)
> >>> myDF.select(myDF['city_name'])
> Traceback (most recent call last):
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply.
> : org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, 
> could be: city_name#29, city_name#30.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
>   at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 943, in 
> __getitem__
> jc = self._jdf.apply(item)
>   File "/u

[jira] [Assigned] (SPARK-21228) InSet incorrect handling of structs

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21228:


Assignee: Apache Spark

> InSet incorrect handling of structs
> ---
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>Assignee: Apache Spark
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet uses hset.contains (both in 
> doCodeGen and eval) which will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen uses compareStructs and seems to work. In.eval might not work 
> but not sure how to reproduce.
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet or not 
> trigger InSet optimization at all in this case.
> Need to investigate if In.eval is affected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21228) InSet incorrect handling of structs

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21228:


Assignee: (was: Apache Spark)

> InSet incorrect handling of structs
> ---
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet uses hset.contains (both in 
> doCodeGen and eval) which will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen uses compareStructs and seems to work. In.eval might not work 
> but not sure how to reproduce.
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet or not 
> trigger InSet optimization at all in this case.
> Need to investigate if In.eval is affected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21228) InSet incorrect handling of structs

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066457#comment-16066457
 ] 

Apache Spark commented on SPARK-21228:
--

User 'bogdanrdc' has created a pull request for this issue:
https://github.com/apache/spark/pull/18455

> InSet incorrect handling of structs
> ---
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet uses hset.contains (both in 
> doCodeGen and eval) which will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen uses compareStructs and seems to work. In.eval might not work 
> but not sure how to reproduce.
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet or not 
> trigger InSet optimization at all in this case.
> Need to investigate if In.eval is affected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe

2017-06-28 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066423#comment-16066423
 ] 

Hyukjin Kwon commented on SPARK-21227:
--

I took a quick look and the cause seems to be here - 
https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/package.scala#L36
 . It appears that this compares both strings ignoring Turkish locale. I was 
thinking doing toLowercase comparison but I guess we should investigate if such 
change introduces other corner cases of behaviour changes.

> Unicode in Json field causes AnalysisException when selecting from Dataframe
> 
>
> Key: SPARK-21227
> URL: https://issues.apache.org/jira/browse/SPARK-21227
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Seydou Dia
>
> Hi,
> please find below the step to reproduce the issue I am facing.
> First I create a json with 2 fields:
> * city_name
> * cıty_name
> The first one is valid ascii, while the second contains a unicode (ı, i 
> without dot ).
> When I try to select from the dataframe I have an  {noformat} 
> AnalysisException {noformat}.
> {code:python}
> $ pyspark
> Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
> Attempting port 4042.
> 17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
> SparkSession available as 'spark'.
> >>> sc=spark.sparkContext
> >>> js = ['{"city_name": "paris"}'
> ... , '{"city_name": "rome"}'
> ... , '{"city_name": "berlin"}'
> ... , '{"cıty_name": "new-york"}'
> ... , '{"cıty_name": "toronto"}'
> ... , '{"cıty_name": "chicago"}'
> ... , '{"cıty_name": "dubai"}']
> >>> myRDD = sc.parallelize(js)
> >>> myDF = spark.read.json(myRDD)
> >>> myDF.printSchema()
> >>>   
> root
>  |-- city_name: string (nullable = true)
>  |-- cıty_name: string (nullable = true)
> >>> myDF.select(myDF['city_name'])
> Traceback (most recent call last):
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply.
> : org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, 
> could be: city_name#29, city_name#30.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
>   at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 943, in 
> __getitem__
> jc = self._jdf.apply(item)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
> line 1133, in __call__
>   File "/usr/lib/spark/python/pyspark/sql

[jira] [Commented] (SPARK-17091) Convert IN predicate to equivalent Parquet filter

2017-06-28 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066363#comment-16066363
 ] 

Michael Styles commented on SPARK-17091:


n Parquet 1.7, there as a bug involving corrupt statistics on binary columns 
(https://issues.apache.org/jira/browse/PARQUET-251). This bug prevented earlier 
versions of Spark from generating Parquet filters on any string columns. Spark 
2.1 has moved up to Parquet 1.8.2, so this issue no longer exists.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21240) Fix code style for constructing and stopping a SparkContext in UT

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21240:


Assignee: Apache Spark

> Fix code style for constructing and stopping a SparkContext in UT
> -
>
> Key: SPARK-21240
> URL: https://issues.apache.org/jira/browse/SPARK-21240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>Assignee: Apache Spark
>Priority: Trivial
>
> Related to SPARK-20985.
> Fix code style for constructing and stopping a SparkContext. Assure the 
> context is stopped to avoid other tests complain that there's only one 
> SparkContext can exist.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21240) Fix code style for constructing and stopping a SparkContext in UT

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066360#comment-16066360
 ] 

Apache Spark commented on SPARK-21240:
--

User 'jinxing64' has created a pull request for this issue:
https://github.com/apache/spark/pull/18454

> Fix code style for constructing and stopping a SparkContext in UT
> -
>
> Key: SPARK-21240
> URL: https://issues.apache.org/jira/browse/SPARK-21240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>Priority: Trivial
>
> Related to SPARK-20985.
> Fix code style for constructing and stopping a SparkContext. Assure the 
> context is stopped to avoid other tests complain that there's only one 
> SparkContext can exist.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21240) Fix code style for constructing and stopping a SparkContext in UT

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21240:


Assignee: (was: Apache Spark)

> Fix code style for constructing and stopping a SparkContext in UT
> -
>
> Key: SPARK-21240
> URL: https://issues.apache.org/jira/browse/SPARK-21240
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: jin xing
>Priority: Trivial
>
> Related to SPARK-20985.
> Fix code style for constructing and stopping a SparkContext. Assure the 
> context is stopped to avoid other tests complain that there's only one 
> SparkContext can exist.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21240) Fix code style for constructing and stopping a SparkContext in UT

2017-06-28 Thread jin xing (JIRA)
jin xing created SPARK-21240:


 Summary: Fix code style for constructing and stopping a 
SparkContext in UT
 Key: SPARK-21240
 URL: https://issues.apache.org/jira/browse/SPARK-21240
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: jin xing
Priority: Trivial


Related to SPARK-20985.
Fix code style for constructing and stopping a SparkContext. Assure the context 
is stopped to avoid other tests complain that there's only one SparkContext can 
exist.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21223) Thread-safety issue in FsHistoryProvider

2017-06-28 Thread zenglinxi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066319#comment-16066319
 ] 

zenglinxi edited comment on SPARK-21223 at 6/28/17 10:42 AM:
-

[~sowen] ok, i will check SPARK-21078 first.


was (Author: gostop_zlx):
ok, i will check SPARK-21078 first.

> Thread-safety issue in FsHistoryProvider 
> -
>
> Key: SPARK-21223
> URL: https://issues.apache.org/jira/browse/SPARK-21223
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: zenglinxi
>
> Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class 
> FsHistoryProvider to store the map of eventlog path and attemptInfo. 
> When use ThreadPool to Replay the log files in the list and merge the list of 
> old applications with new ones, multi thread may update fileToAppInfo at the 
> same time, which may cause Thread-safety issues.
> {code:java}
> for (file <- logInfos) {
>tasks += replayExecutor.submit(new Runnable {
> override def run(): Unit = mergeApplicationListing(file)
>  })
>  }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21223) Thread-safety issue in FsHistoryProvider

2017-06-28 Thread zenglinxi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066319#comment-16066319
 ] 

zenglinxi commented on SPARK-21223:
---

ok, i will check SPARK-21078 first.

> Thread-safety issue in FsHistoryProvider 
> -
>
> Key: SPARK-21223
> URL: https://issues.apache.org/jira/browse/SPARK-21223
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: zenglinxi
>
> Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class 
> FsHistoryProvider to store the map of eventlog path and attemptInfo. 
> When use ThreadPool to Replay the log files in the list and merge the list of 
> old applications with new ones, multi thread may update fileToAppInfo at the 
> same time, which may cause Thread-safety issues.
> {code:java}
> for (file <- logInfos) {
>tasks += replayExecutor.submit(new Runnable {
> override def run(): Unit = mergeApplicationListing(file)
>  })
>  }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs

2017-06-28 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-19852:
---

Assignee: Vincent

> StringIndexer.setHandleInvalid should have another option 'new': Python API 
> and docs
> 
>
> Key: SPARK-19852
> URL: https://issues.apache.org/jira/browse/SPARK-19852
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Vincent
>Priority: Minor
>
> Update Python API for StringIndexer so setHandleInvalid doc is correct.  This 
> will probably require:
> * putting HandleInvalid within StringIndexer to update its built-in doc (See 
> Bucketizer for an example.)
> * updating API docs and maybe the guide



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066289#comment-16066289
 ] 

Apache Spark commented on SPARK-19852:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/18453

> StringIndexer.setHandleInvalid should have another option 'new': Python API 
> and docs
> 
>
> Key: SPARK-19852
> URL: https://issues.apache.org/jira/browse/SPARK-19852
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Update Python API for StringIndexer so setHandleInvalid doc is correct.  This 
> will probably require:
> * putting HandleInvalid within StringIndexer to update its built-in doc (See 
> Bucketizer for an example.)
> * updating API docs and maybe the guide



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21239) Support WAL recover in windows

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21239:


Assignee: Apache Spark

> Support WAL recover in windows
> --
>
> Key: SPARK-21239
> URL: https://issues.apache.org/jira/browse/SPARK-21239
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Windows
>Affects Versions: 1.6.3, 2.1.0, 2.1.1, 2.2.0
>Reporter: Yun Tang
>Assignee: Apache Spark
> Fix For: 2.1.2, 2.2.1
>
>
> When driver failed over, it will read WAL from HDFS by calling 
> WriteAheadLogBackedBlockRDD.getBlockFromWriteAheadLog(), however, it need a 
> dummy local path to satisfy the method parameter requirements, but the path 
> in windows will contain a colon which is not valid for hadoop. I removed the 
> potential driver letter and colon.
> I found one email from spark-user ever talked about this bug 
> (https://www.mail-archive.com/user@spark.apache.org/msg55030.html)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21239) Support WAL recover in windows

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066282#comment-16066282
 ] 

Apache Spark commented on SPARK-21239:
--

User 'Myasuka' has created a pull request for this issue:
https://github.com/apache/spark/pull/18452

> Support WAL recover in windows
> --
>
> Key: SPARK-21239
> URL: https://issues.apache.org/jira/browse/SPARK-21239
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Windows
>Affects Versions: 1.6.3, 2.1.0, 2.1.1, 2.2.0
>Reporter: Yun Tang
> Fix For: 2.1.2, 2.2.1
>
>
> When driver failed over, it will read WAL from HDFS by calling 
> WriteAheadLogBackedBlockRDD.getBlockFromWriteAheadLog(), however, it need a 
> dummy local path to satisfy the method parameter requirements, but the path 
> in windows will contain a colon which is not valid for hadoop. I removed the 
> potential driver letter and colon.
> I found one email from spark-user ever talked about this bug 
> (https://www.mail-archive.com/user@spark.apache.org/msg55030.html)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21239) Support WAL recover in windows

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21239:


Assignee: (was: Apache Spark)

> Support WAL recover in windows
> --
>
> Key: SPARK-21239
> URL: https://issues.apache.org/jira/browse/SPARK-21239
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Windows
>Affects Versions: 1.6.3, 2.1.0, 2.1.1, 2.2.0
>Reporter: Yun Tang
> Fix For: 2.1.2, 2.2.1
>
>
> When driver failed over, it will read WAL from HDFS by calling 
> WriteAheadLogBackedBlockRDD.getBlockFromWriteAheadLog(), however, it need a 
> dummy local path to satisfy the method parameter requirements, but the path 
> in windows will contain a colon which is not valid for hadoop. I removed the 
> potential driver letter and colon.
> I found one email from spark-user ever talked about this bug 
> (https://www.mail-archive.com/user@spark.apache.org/msg55030.html)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21239) Support WAL recover in windows

2017-06-28 Thread Yun Tang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yun Tang updated SPARK-21239:
-
Description: 
When driver failed over, it will read WAL from HDFS by calling 
WriteAheadLogBackedBlockRDD.getBlockFromWriteAheadLog(), however, it need a 
dummy local path to satisfy the method parameter requirements, but the path in 
windows will contain a colon which is not valid for hadoop. I removed the 
potential driver letter and colon.

I found one email from spark-user ever talked about this bug 
(https://www.mail-archive.com/user@spark.apache.org/msg55030.html)

  was:
When driver failed over, it will read WAL from HDFS by calling 
WriteAheadLogBackedBlockRDD.getBlockFromWriteAheadLog(), however, it need a 
dummy local path to satisfy the method parameter requirements, but the path in 
windows will contain a colon which is not valid for hadoop. I removed the 
potential driver letter and colon.

I found one email from spark-user ever talked about [this 
bug](https://www.mail-archive.com/user@spark.apache.org/msg55030.html)


> Support WAL recover in windows
> --
>
> Key: SPARK-21239
> URL: https://issues.apache.org/jira/browse/SPARK-21239
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Windows
>Affects Versions: 1.6.3, 2.1.0, 2.1.1, 2.2.0
>Reporter: Yun Tang
> Fix For: 2.1.2, 2.2.1
>
>
> When driver failed over, it will read WAL from HDFS by calling 
> WriteAheadLogBackedBlockRDD.getBlockFromWriteAheadLog(), however, it need a 
> dummy local path to satisfy the method parameter requirements, but the path 
> in windows will contain a colon which is not valid for hadoop. I removed the 
> potential driver letter and colon.
> I found one email from spark-user ever talked about this bug 
> (https://www.mail-archive.com/user@spark.apache.org/msg55030.html)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21239) Support WAL recover in windows

2017-06-28 Thread Yun Tang (JIRA)
Yun Tang created SPARK-21239:


 Summary: Support WAL recover in windows
 Key: SPARK-21239
 URL: https://issues.apache.org/jira/browse/SPARK-21239
 Project: Spark
  Issue Type: Bug
  Components: DStreams, Windows
Affects Versions: 2.1.1, 2.1.0, 1.6.3, 2.2.0
Reporter: Yun Tang
 Fix For: 2.1.2, 2.2.1


When driver failed over, it will read WAL from HDFS by calling 
WriteAheadLogBackedBlockRDD.getBlockFromWriteAheadLog(), however, it need a 
dummy local path to satisfy the method parameter requirements, but the path in 
windows will contain a colon which is not valid for hadoop. I removed the 
potential driver letter and colon.

I found one email from spark-user ever talked about [this 
bug](https://www.mail-archive.com/user@spark.apache.org/msg55030.html)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17091) Convert IN predicate to equivalent Parquet filter

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17091:


Assignee: (was: Apache Spark)

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17091) Convert IN predicate to equivalent Parquet filter

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066222#comment-16066222
 ] 

Apache Spark commented on SPARK-17091:
--

User 'ptkool' has created a pull request for this issue:
https://github.com/apache/spark/pull/18424

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17091) Convert IN predicate to equivalent Parquet filter

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17091:


Assignee: Apache Spark

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
>Assignee: Apache Spark
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21233) Support pluggable offset storage

2017-06-28 Thread darion yaphet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179
 ] 

darion yaphet edited comment on SPARK-21233 at 6/28/17 9:14 AM:


Hi  [Sean|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=srowen]  
In Kafka-0.8  it's using zkClient to commit offset into zookeeper cluster . 
It's seems Kafka 0.10 + could save offset in topic . I wish to add some config 
item to control the storage instance and other parameter . 


was (Author: darion):
[Sean|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=srowen]  In 
Kafka-0.8  it's using zkClient to commit offset into zookeeper cluster . It's 
seems Kafka 0.10 + could save offset in topic . I wish to add some config item 
to control the storage instance and other parameter . 

> Support pluggable offset storage
> 
>
> Key: SPARK-21233
> URL: https://issues.apache.org/jira/browse/SPARK-21233
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there 
> are a lot of streaming program running in the cluster the ZooKeeper Cluster's 
> loading is very high . Maybe Zookeeper is not very suitable to save offset 
> periodicity.
> This issue is wish to support a pluggable offset storage to avoid save it in 
> the zookeeper 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21233) Support pluggable offset storage

2017-06-28 Thread darion yaphet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179
 ] 

darion yaphet edited comment on SPARK-21233 at 6/28/17 9:13 AM:


[Sean|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=srowen]  In 
Kafka-0.8  it's using zkClient to commit offset into zookeeper cluster . It's 
seems Kafka 0.10 + could save offset in topic . I wish to add some config item 
to control the storage instance and other parameter . 


was (Author: darion):
[Sean|sro...@gmail.com]  In Kafka-0.8  it's using zkClient to commit offset 
into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I 
wish to add some config item to control the storage instance and other 
parameter . 

> Support pluggable offset storage
> 
>
> Key: SPARK-21233
> URL: https://issues.apache.org/jira/browse/SPARK-21233
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there 
> are a lot of streaming program running in the cluster the ZooKeeper Cluster's 
> loading is very high . Maybe Zookeeper is not very suitable to save offset 
> periodicity.
> This issue is wish to support a pluggable offset storage to avoid save it in 
> the zookeeper 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21233) Support pluggable offset storage

2017-06-28 Thread darion yaphet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179
 ] 

darion yaphet edited comment on SPARK-21233 at 6/28/17 9:13 AM:


[Sean|sro...@gmail.com]  In Kafka-0.8  it's using zkClient to commit offset 
into zookeeper cluster . It's seems Kafka 0.10 + could save offset in topic . I 
wish to add some config item to control the storage instance and other 
parameter . 


was (Author: darion):
[Sean Owen|sro...@gmail.com]  In Kafka-0.8  it's using zkClient to commit 
offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in 
topic . I wish to add some config item to control the storage instance and 
other parameter . 

> Support pluggable offset storage
> 
>
> Key: SPARK-21233
> URL: https://issues.apache.org/jira/browse/SPARK-21233
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there 
> are a lot of streaming program running in the cluster the ZooKeeper Cluster's 
> loading is very high . Maybe Zookeeper is not very suitable to save offset 
> periodicity.
> This issue is wish to support a pluggable offset storage to avoid save it in 
> the zookeeper 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21137) Spark reads many small files slowly off local filesystem

2017-06-28 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-21137:
---
Summary: Spark reads many small files slowly off local filesystem  (was: 
Spark reads many small files slowly)

> Spark reads many small files slowly off local filesystem
> 
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21233) Support pluggable offset storage

2017-06-28 Thread darion yaphet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066179#comment-16066179
 ] 

darion yaphet commented on SPARK-21233:
---

[Sean Owen|sro...@gmail.com]  In Kafka-0.8  it's using zkClient to commit 
offset into zookeeper cluster . It's seems Kafka 0.10 + could save offset in 
topic . I wish to add some config item to control the storage instance and 
other parameter . 

> Support pluggable offset storage
> 
>
> Key: SPARK-21233
> URL: https://issues.apache.org/jira/browse/SPARK-21233
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there 
> are a lot of streaming program running in the cluster the ZooKeeper Cluster's 
> loading is very high . Maybe Zookeeper is not very suitable to save offset 
> periodicity.
> This issue is wish to support a pluggable offset storage to avoid save it in 
> the zookeeper 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18004) DataFrame filter Predicate push-down fails for Oracle Timestamp type columns

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066162#comment-16066162
 ] 

Apache Spark commented on SPARK-18004:
--

User 'SharpRay' has created a pull request for this issue:
https://github.com/apache/spark/pull/18451

> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns
> 
>
> Key: SPARK-18004
> URL: https://issues.apache.org/jira/browse/SPARK-18004
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Suhas Nalapure
>Priority: Critical
>
> DataFrame filter Predicate push-down fails for Oracle Timestamp type columns 
> with Exception java.sql.SQLDataException: ORA-01861: literal does not match 
> format string:
> Java source code (this code works fine for mysql & mssql databases) :
> {noformat}
> //DataFrame df = create a DataFrame over an Oracle table
> df = df.filter(df.col("TS").lt(new 
> java.sql.Timestamp(System.currentTimeMillis(;
>   df.explain();
>   df.show();
> {noformat}
> Log statements with the Exception:
> {noformat}
> Schema: root
>  |-- ID: string (nullable = false)
>  |-- TS: timestamp (nullable = true)
>  |-- DEVICE_ID: string (nullable = true)
>  |-- REPLACEMENT: string (nullable = true)
> {noformat}
> {noformat}
> == Physical Plan ==
> Filter (TS#1 < 1476861841934000)
> +- Scan 
> JDBCRelation(jdbc:oracle:thin:@10.0.0.111:1521:orcl,ORATABLE,[Lorg.apache.spark.Partition;@78c74647,{user=user,
>  password=pwd, url=jdbc:oracle:thin:@10.0.0.111:1521:orcl, dbtable=ORATABLE, 
> driver=oracle.jdbc.driver.OracleDriver})[ID#0,TS#1,DEVICE_ID#2,REPLACEMENT#3] 
> PushedFilters: [LessThan(TS,2016-10-19 12:54:01.934)]
> 2016-10-19 12:54:04,268 ERROR [Executor task launch worker-0] 
> org.apache.spark.executor.Executor
> Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: ORA-01861: literal does not match format string
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461)
>   at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402)
>   at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1065)
>   at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:681)
>   at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:256)
>   at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:577)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:239)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.doOall8(T4CPreparedStatement.java:75)
>   at 
> oracle.jdbc.driver.T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:1043)
>   at 
> oracle.jdbc.driver.OracleStatement.executeMaybeDescribe(OracleStatement.java:)
>   at 
> oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1353)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeInternal(OraclePreparedStatement.java:4485)
>   at 
> oracle.jdbc.driver.OraclePreparedStatement.executeQuery(OraclePreparedStatement.java:4566)
>   at 
> oracle.jdbc.driver.OraclePreparedStatementWrapper.executeQuery(OraclePreparedStatementWrapper.java:5251)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:383)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:359)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe

2017-06-28 Thread Seydou Dia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066151#comment-16066151
 ] 

Seydou Dia commented on SPARK-21227:


Hi [~hyukjin.kwon],

thanks for confirming this. I happen to be a Python dev and jr. scala. I would 
love to help on this, with some guidance I think can help.

Best,
Seydou

> Unicode in Json field causes AnalysisException when selecting from Dataframe
> 
>
> Key: SPARK-21227
> URL: https://issues.apache.org/jira/browse/SPARK-21227
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0
>Reporter: Seydou Dia
>
> Hi,
> please find below the step to reproduce the issue I am facing.
> First I create a json with 2 fields:
> * city_name
> * cıty_name
> The first one is valid ascii, while the second contains a unicode (ı, i 
> without dot ).
> When I try to select from the dataframe I have an  {noformat} 
> AnalysisException {noformat}.
> {code:python}
> $ pyspark
> Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
> Attempting port 4041.
> 17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
> Attempting port 4042.
> 17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
> is set, falling back to uploading libraries under SPARK_HOME.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.1.0
>   /_/
> Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
> SparkSession available as 'spark'.
> >>> sc=spark.sparkContext
> >>> js = ['{"city_name": "paris"}'
> ... , '{"city_name": "rome"}'
> ... , '{"city_name": "berlin"}'
> ... , '{"cıty_name": "new-york"}'
> ... , '{"cıty_name": "toronto"}'
> ... , '{"cıty_name": "chicago"}'
> ... , '{"cıty_name": "dubai"}']
> >>> myRDD = sc.parallelize(js)
> >>> myDF = spark.read.json(myRDD)
> >>> myDF.printSchema()
> >>>   
> root
>  |-- city_name: string (nullable = true)
>  |-- cıty_name: string (nullable = true)
> >>> myDF.select(myDF['city_name'])
> Traceback (most recent call last):
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
> 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply.
> : org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, 
> could be: city_name#29, city_name#30.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
>   at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 943, in 
> __getitem__
> jc = self._jdf.apply(item)
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
> line 1133, in __call__
>   File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
> raise AnalysisException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.AnalysisException: "Reference 'city_name' is ambiguous, 
> could be: city_name#29, city_name#30.;"
> {code}



--
This message was sent by Atlassian JIRA

[jira] [Assigned] (SPARK-21238) allow nested SQL execution

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21238:


Assignee: Wenchen Fan  (was: Apache Spark)

> allow nested SQL execution
> --
>
> Key: SPARK-21238
> URL: https://issues.apache.org/jira/browse/SPARK-21238
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21238) allow nested SQL execution

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066130#comment-16066130
 ] 

Apache Spark commented on SPARK-21238:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18450

> allow nested SQL execution
> --
>
> Key: SPARK-21238
> URL: https://issues.apache.org/jira/browse/SPARK-21238
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21238) allow nested SQL execution

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21238:


Assignee: Apache Spark  (was: Wenchen Fan)

> allow nested SQL execution
> --
>
> Key: SPARK-21238
> URL: https://issues.apache.org/jira/browse/SPARK-21238
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21238) allow nested SQL execution

2017-06-28 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-21238:
---

 Summary: allow nested SQL execution
 Key: SPARK-21238
 URL: https://issues.apache.org/jira/browse/SPARK-21238
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21237) Invalidate stats once table data is changed

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21237:


Assignee: Apache Spark

> Invalidate stats once table data is changed
> ---
>
> Key: SPARK-21237
> URL: https://issues.apache.org/jira/browse/SPARK-21237
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21237) Invalidate stats once table data is changed

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066097#comment-16066097
 ] 

Apache Spark commented on SPARK-21237:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/18449

> Invalidate stats once table data is changed
> ---
>
> Key: SPARK-21237
> URL: https://issues.apache.org/jira/browse/SPARK-21237
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21237) Invalidate stats once table data is changed

2017-06-28 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21237:


Assignee: (was: Apache Spark)

> Invalidate stats once table data is changed
> ---
>
> Key: SPARK-21237
> URL: https://issues.apache.org/jira/browse/SPARK-21237
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21237) Invalidate stats once table data is changed

2017-06-28 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-21237:


 Summary: Invalidate stats once table data is changed
 Key: SPARK-21237
 URL: https://issues.apache.org/jira/browse/SPARK-21237
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Zhenhua Wang






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21233) Support pluggable offset storage

2017-06-28 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066088#comment-16066088
 ] 

Sean Owen commented on SPARK-21233:
---

Where would you put it instead? Kafka already provides for storing offsets in 
Kafka.
This sounds like a big change and I don't see detail or design here.

> Support pluggable offset storage
> 
>
> Key: SPARK-21233
> URL: https://issues.apache.org/jira/browse/SPARK-21233
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.0.2, 2.1.1
>Reporter: darion yaphet
>
> Currently we using *ZooKeeper* to save the *Kafka Commit Offset* , when there 
> are a lot of streaming program running in the cluster the ZooKeeper Cluster's 
> loading is very high . Maybe Zookeeper is not very suitable to save offset 
> periodicity.
> This issue is wish to support a pluggable offset storage to avoid save it in 
> the zookeeper 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21234) When the function returns Option[Iterator[_]] is None,then get on None will cause java.util.NoSuchElementException: None.get

2017-06-28 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21234.
---
Resolution: Invalid

Not if it's known the value exists. I don't see you've established any actual 
problem here. Please read http://spark.apache.org/contributing.html

> When the function returns Option[Iterator[_]] is None,then get on None will 
> cause java.util.NoSuchElementException: None.get
> 
>
> Key: SPARK-21234
> URL: https://issues.apache.org/jira/browse/SPARK-21234
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.1.1
>Reporter: wangjiaochun
>
> Class BlockManager {
> ...
> def getLocalValues(blockId: BlockId): Option[BlockResult] ={
> ...
> memoryStore.getValues(blockId).get
> ...
> }
> ..
> }
> The above code getValues return three type values: 
> None,IllegalArgumentException and normal ,if return None,Cause 
> java.util.NoSuchElementException: None.get。so I think this is potential Bug;



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20889) SparkR grouped documentation for Column methods

2017-06-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16066033#comment-16066033
 ] 

Apache Spark commented on SPARK-20889:
--

User 'actuaryzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/18448

> SparkR grouped documentation for Column methods
> ---
>
> Key: SPARK-20889
> URL: https://issues.apache.org/jira/browse/SPARK-20889
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>  Labels: documentation
> Fix For: 2.3.0
>
>
> Group the documentation of individual methods defined for the Column class. 
> This aims to create the following improvements:
> - Centralized documentation for easy navigation (user can view multiple 
> related methods on one single page).
> - Reduced number of items in Seealso.
> - Betters examples using shared data. This avoids creating a data frame for 
> each function if they are documented separately. And more importantly, user 
> can copy and paste to run them directly!
> - Cleaner structure and much fewer Rd files (remove a large number of Rd 
> files).
> - Remove duplicated definition of param (since they share exactly the same 
> argument).
> - No need to write meaningless examples for trivial functions (because of 
> grouping).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org