[jira] [Resolved] (SPARK-10483) spark-submit can not support symbol link

2015-09-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10483.
---
Resolution: Duplicate

Please have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first. 
You should search JIRA first too, since there are a few issues about this. I 
also don't believe it is a bug.

> spark-submit can not support symbol link
> 
>
> Key: SPARK-10483
> URL: https://issues.apache.org/jira/browse/SPARK-10483
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.1, 1.4.1
> Environment: Red Hat Enterprise Linux Server release 6.4 (Santiago)
>Reporter: xuqing
>
> Create a symbol link for spark-submit
> {quote}
> [root@xqwin03 bin]# ll spark-submit
> lrwxrwxrwx 1 root root 47 Sep  8 02:49 spark-submit -> 
> /opt/spark-1.3.1-bin-hadoop2.4/bin/spark-submit
> {quote}
> run spark-submit meets following errors:
> {color:red}
> /usr/bin/spark-submit: line 50: /usr/bin/spark-class: No such file or 
> directory
> {color}
> The reason is 
> SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
> can not handle symbol link
> change to 
> {color:red}
> SPARK_HOME="$(cd "`dirname $(readlink -nf "$0")`"/.. ; pwd -P)"
> {color}
> can fix this problem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10483) spark-submit can not support symbol link

2015-09-08 Thread xuqing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuqing updated SPARK-10483:
---
Environment: Red Hat Enterprise Linux Server release 6.4 (Santiago)  (was: 
[root@xqwin03 bin]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.4 (Santiago)

[root@xqwin03 bin]# uname -a
Linux xqwin03 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 
x86_64 x86_64 GNU/Linux)

> spark-submit can not support symbol link
> 
>
> Key: SPARK-10483
> URL: https://issues.apache.org/jira/browse/SPARK-10483
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.1, 1.4.1
> Environment: Red Hat Enterprise Linux Server release 6.4 (Santiago)
>Reporter: xuqing
>
> Create a symbol link for spark-submit
> {quote}
> [root@xqwin03 bin]# ll spark-submit
> lrwxrwxrwx 1 root root 47 Sep  8 02:49 spark-submit -> 
> /opt/spark-1.3.1-bin-hadoop2.4/bin/spark-submit
> {quote}
> run spark-submit meets following errors:
> {color:red}
> /usr/bin/spark-submit: line 50: /usr/bin/spark-class: No such file or 
> directory
> {color}
> The reason is 
> SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
> can not handle symbol link
> change to 
> {color:red}
> SPARK_HOME="$(cd "`dirname $(readlink -nf "$0")`"/.. ; pwd -P)"
> {color}
> can fix this problem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when cross join happen

2015-09-08 Thread Yi Zhou (JIRA)
Yi Zhou created SPARK-10484:
---

 Summary: [Spark SQL]  Come across lost task(timeout) or GC OOM 
error when cross join happen
 Key: SPARK-10484
 URL: https://issues.apache.org/jira/browse/SPARK-10484
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yi Zhou
Priority: Critical


Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10481) SPARK_PREPEND_CLASSES make spark-yarn related jar could not be found

2015-09-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10481:
--
Priority: Minor  (was: Major)

> SPARK_PREPEND_CLASSES make spark-yarn related jar could not be found
> 
>
> Key: SPARK-10481
> URL: https://issues.apache.org/jira/browse/SPARK-10481
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.4.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> It happens when SPARK_PREPEND_CLASSES is set and run spark on yarn.
> If SPARK_PREPEND_CLASSES, spark-yarn related jar won't be found. Because the 
> org.apache.spark.deploy.Client is detected as individual class rather class 
> in jar. 
> {code}
> 15/09/08 08:57:10 ERROR SparkContext: Error initializing SparkContext.
> java.util.NoSuchElementException: head of empty list
>   at scala.collection.immutable.Nil$.head(List.scala:337)
>   at scala.collection.immutable.Nil$.head(List.scala:334)
>   at 
> org.apache.spark.deploy.yarn.Client$.org$apache$spark$deploy$yarn$Client$$sparkJar(Client.scala:1048)
>   at 
> org.apache.spark.deploy.yarn.Client$.populateClasspath(Client.scala:1159)
>   at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:534)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:645)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:514)
>   at com.zjffdu.tutorial.spark.WordCount$.main(WordCount.scala:24)
>   at com.zjffdu.tutorial.spark.WordCount.main(WordCount.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:680)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3369) Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator

2015-09-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734495#comment-14734495
 ] 

Sean Owen commented on SPARK-3369:
--

I don't think there's a "why" -- just hasn't been done by someone who wants to 
do it. I think it's fine to document this. It would more constructive if you 
opened a PR to this effect.

I hesitate to guarantee that the {{Iterable}} will only be traversed once, 
though I think that happens to be true now. Hence, if this is mentioned in 
docs, it should be suggested as an at-your-own-risk workaround.

I say above in the description that it was an oversight and inadvertent 
inconsistency. Nobody has ever suggested otherwise so I'm pretty sure that's 
all there is to it. There's no reason it should be inconsistent with the Scala 
impl since it just wraps / follows it in parallel. What else are you looking 
for? IIRC this was confirmed by Matei or Josh a while ago. Do you expect 
otherwise?

> Java mapPartitions Iterator->Iterable is inconsistent with Scala's 
> Iterator->Iterator
> -
>
> Key: SPARK-3369
> URL: https://issues.apache.org/jira/browse/SPARK-3369
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 1.0.2, 1.2.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>  Labels: breaking_change
> Attachments: FlatMapIterator.patch
>
>
> {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
> {{Iterator}} to an {{Iterator}}: 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
> In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
> an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
> stronger condition and appears inconsistent. It's a problematic inconsistent 
> though because this seems to require copying all of the input into memory in 
> order to create an object that can be iterated many times, since the input 
> does not afford this itself.
> Similarity for other {{mapPartitions*}} methods and other 
> {{*FlatMapFunctions}}s in Java.
> (Is there a reason for this difference that I'm overlooking?)
> If I'm right that this was inadvertent inconsistency, then the big issue here 
> is that of course this is part of a public API. Workarounds I can think of:
> Promise that Spark will only call {{iterator()}} once, so implementors can 
> use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
> Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
> desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when cross join happen

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Description: 
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}

  was:
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}


> [Spark SQL]  Come across lost task(timeout) or GC OOM error when cross join 
> happen
> --
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when cross join happen

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Description: 
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}

We also found a strange bebavior 

  was:
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}


> [Spark SQL]  Come across lost task(timeout) or GC OOM error when cross join 
> happen
> --
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange bebavior 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables do cross join

2015-09-08 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734395#comment-14734395
 ] 

Cheng Hao commented on SPARK-10484:
---

In cartesian produce implementation, there is 2 level nested loops, and 
exchanging the order of the join tables, will reduce the outer loop times(with 
smaller table), but increase the looping times of the inner loop(bigger table), 
this is actually a manually optimization for the sql query.

I created a PR for optimizing the cartesian join by involving the broadcast 
join.

> [Spark SQL]  Come across lost task(timeout) or GC OOM error when two tables 
> do cross join
> -
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) 
> AS store_ID#446,pr_review_date#451,pr_review_content#457]
> Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
>   CartesianProduct
>HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
>HiveTableScan [pr_review_date#451,pr_review_content#457], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10483) spark-submit can not support symbol link

2015-09-08 Thread xuqing (JIRA)
xuqing created SPARK-10483:
--

 Summary: spark-submit can not support symbol link
 Key: SPARK-10483
 URL: https://issues.apache.org/jira/browse/SPARK-10483
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.4.1, 1.3.1
 Environment: [root@xqwin03 bin]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 6.4 (Santiago)

[root@xqwin03 bin]# uname -a
Linux xqwin03 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 x86_64 
x86_64 x86_64 GNU/Linux
Reporter: xuqing


Create a symbol link for ${SPARK_HOME}/spark-submit
run spark-submit meets following errors:
{color:red}
/usr/bin/spark-submit: line 50: /usr/bin/spark-class: No such file or directory
{color}

The reason is 
SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
can not handle symbol link

change to 
{color:red}
SPARK_HOME="$(cd "`dirname $(readlink -nf "$0")`"/.. ; pwd -P)"
{color}
can fix this problem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10483) spark-submit can not support symbol link

2015-09-08 Thread xuqing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuqing updated SPARK-10483:
---
Description: 
Create a symbol link for spark-submit
{quote}
[root@xqwin03 bin]# ll spark-submit
lrwxrwxrwx 1 root root 47 Sep  8 02:49 spark-submit -> 
/opt/spark-1.3.1-bin-hadoop2.4/bin/spark-submit
{quote}

run spark-submit meets following errors:
{color:red}
/usr/bin/spark-submit: line 50: /usr/bin/spark-class: No such file or directory
{color}

The reason is 
SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
can not handle symbol link

change to 
{color:red}
SPARK_HOME="$(cd "`dirname $(readlink -nf "$0")`"/.. ; pwd -P)"
{color}
can fix this problem

  was:
Create a symbol link for spark-submit
run spark-submit meets following errors:
{color:red}
/usr/bin/spark-submit: line 50: /usr/bin/spark-class: No such file or directory
{color}

The reason is 
SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
can not handle symbol link

change to 
{color:red}
SPARK_HOME="$(cd "`dirname $(readlink -nf "$0")`"/.. ; pwd -P)"
{color}
can fix this problem


> spark-submit can not support symbol link
> 
>
> Key: SPARK-10483
> URL: https://issues.apache.org/jira/browse/SPARK-10483
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.1, 1.4.1
> Environment: [root@xqwin03 bin]# cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 6.4 (Santiago)
> [root@xqwin03 bin]# uname -a
> Linux xqwin03 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 
> x86_64 x86_64 x86_64 GNU/Linux
>Reporter: xuqing
>
> Create a symbol link for spark-submit
> {quote}
> [root@xqwin03 bin]# ll spark-submit
> lrwxrwxrwx 1 root root 47 Sep  8 02:49 spark-submit -> 
> /opt/spark-1.3.1-bin-hadoop2.4/bin/spark-submit
> {quote}
> run spark-submit meets following errors:
> {color:red}
> /usr/bin/spark-submit: line 50: /usr/bin/spark-class: No such file or 
> directory
> {color}
> The reason is 
> SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
> can not handle symbol link
> change to 
> {color:red}
> SPARK_HOME="$(cd "`dirname $(readlink -nf "$0")`"/.. ; pwd -P)"
> {color}
> can fix this problem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when cross join happen

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Issue Type: Improvement  (was: Bug)

> [Spark SQL]  Come across lost task(timeout) or GC OOM error when cross join 
> happen
> --
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) 
> AS store_ID#446,pr_review_date#451,pr_review_content#457]
> Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
>   CartesianProduct
>HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
>HiveTableScan [pr_review_date#451,pr_review_content#457], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when cross join happen

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Issue Type: Bug  (was: Improvement)

> [Spark SQL]  Come across lost task(timeout) or GC OOM error when cross join 
> happen
> --
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) 
> AS store_ID#446,pr_review_date#451,pr_review_content#457]
> Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
>   CartesianProduct
>HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
>HiveTableScan [pr_review_date#451,pr_review_content#457], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10485) IF expression is not correctly resolved when one of the options have NullType

2015-09-08 Thread Antonio Jesus Navarro (JIRA)
Antonio Jesus Navarro created SPARK-10485:
-

 Summary: IF expression is not correctly resolved when one of the 
options have NullType
 Key: SPARK-10485
 URL: https://issues.apache.org/jira/browse/SPARK-10485
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Antonio Jesus Navarro


If we have this query:

{code}
SELECT IF(column > 1, 1, NULL) FROM T1
{code}

On Spark 1.4.1 we have this:

{code}
override lazy val resolved = childrenResolved && trueValue.dataType == 
falseValue.dataType
{code}

So if one of the types is NullType, the if expression is not resolved.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10479) LogisticRegression copy should copy model summary if available

2015-09-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734504#comment-14734504
 ] 

Sean Owen commented on SPARK-10479:
---

Seems OK, but this seems so logically related to SPARK-10480 that it could have 
/ should have been tracked as one bug. 

> LogisticRegression copy should copy model summary if available
> --
>
> Key: SPARK-10479
> URL: https://issues.apache.org/jira/browse/SPARK-10479
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>  Labels: starter
>
> SPARK-9112 adds LogisticRegressionSummary but [does not copy the model 
> summary if 
> available|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L471]
> We should add behavior similar to that in 
> [LinearRegression.copy|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L314]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two table do cross join

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Summary: [Spark SQL]  Come across lost task(timeout) or GC OOM error when 
two table do cross join  (was: [Spark SQL]  Come across lost task(timeout) or 
GC OOM error when cross join happen)

> [Spark SQL]  Come across lost task(timeout) or GC OOM error when two table do 
> cross join
> 
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) 
> AS store_ID#446,pr_review_date#451,pr_review_content#457]
> Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
>   CartesianProduct
>HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
>HiveTableScan [pr_review_date#451,pr_review_content#457], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables do cross join

2015-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10484:


Assignee: Apache Spark

> [Spark SQL]  Come across lost task(timeout) or GC OOM error when two tables 
> do cross join
> -
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Apache Spark
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) 
> AS store_ID#446,pr_review_date#451,pr_review_content#457]
> Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
>   CartesianProduct
>HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
>HiveTableScan [pr_review_date#451,pr_review_content#457], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables do cross join

2015-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10484:


Assignee: (was: Apache Spark)

> [Spark SQL]  Come across lost task(timeout) or GC OOM error when two tables 
> do cross join
> -
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) 
> AS store_ID#446,pr_review_date#451,pr_review_content#457]
> Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
>   CartesianProduct
>HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
>HiveTableScan [pr_review_date#451,pr_review_content#457], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables do cross join

2015-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734389#comment-14734389
 ] 

Apache Spark commented on SPARK-10484:
--

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/8652

> [Spark SQL]  Come across lost task(timeout) or GC OOM error when two tables 
> do cross join
> -
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) 
> AS store_ID#446,pr_review_date#451,pr_review_content#457]
> Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
>   CartesianProduct
>HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
>HiveTableScan [pr_review_date#451,pr_review_content#457], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables do cross join

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Summary: [Spark SQL]  Come across lost task(timeout) or GC OOM error when 
two tables do cross join  (was: [Spark SQL]  Come across lost task(timeout) or 
GC OOM error when two table do cross join)

> [Spark SQL]  Come across lost task(timeout) or GC OOM error when two tables 
> do cross join
> -
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) 
> AS store_ID#446,pr_review_date#451,pr_review_content#457]
> Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
>   CartesianProduct
>HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
>HiveTableScan [pr_review_date#451,pr_review_content#457], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
> Code Generation: true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10483) spark-submit can not support symbol link

2015-09-08 Thread xuqing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xuqing updated SPARK-10483:
---
Description: 
Create a symbol link for spark-submit
run spark-submit meets following errors:
{color:red}
/usr/bin/spark-submit: line 50: /usr/bin/spark-class: No such file or directory
{color}

The reason is 
SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
can not handle symbol link

change to 
{color:red}
SPARK_HOME="$(cd "`dirname $(readlink -nf "$0")`"/.. ; pwd -P)"
{color}
can fix this problem

  was:
Create a symbol link for ${SPARK_HOME}/spark-submit
run spark-submit meets following errors:
{color:red}
/usr/bin/spark-submit: line 50: /usr/bin/spark-class: No such file or directory
{color}

The reason is 
SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
can not handle symbol link

change to 
{color:red}
SPARK_HOME="$(cd "`dirname $(readlink -nf "$0")`"/.. ; pwd -P)"
{color}
can fix this problem


> spark-submit can not support symbol link
> 
>
> Key: SPARK-10483
> URL: https://issues.apache.org/jira/browse/SPARK-10483
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.1, 1.4.1
> Environment: [root@xqwin03 bin]# cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 6.4 (Santiago)
> [root@xqwin03 bin]# uname -a
> Linux xqwin03 2.6.32-358.el6.x86_64 #1 SMP Tue Jan 29 11:47:41 EST 2013 
> x86_64 x86_64 x86_64 GNU/Linux
>Reporter: xuqing
>
> Create a symbol link for spark-submit
> run spark-submit meets following errors:
> {color:red}
> /usr/bin/spark-submit: line 50: /usr/bin/spark-class: No such file or 
> directory
> {color}
> The reason is 
> SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
> can not handle symbol link
> change to 
> {color:red}
> SPARK_HOME="$(cd "`dirname $(readlink -nf "$0")`"/.. ; pwd -P)"
> {color}
> can fix this problem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when cross join happen

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Description: 
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}

We also found a strange behavior that exchanging the two table in 'From' clause 
can pass.
Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM temp_stores_with_regression stores_with_regression, product_reviews pr
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) AS 
store_ID#446,pr_review_date#451,pr_review_content#457]
Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
  CartesianProduct
   HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))
   HiveTableScan [pr_review_date#451,pr_review_content#457], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
Code Generation: true
{code}


  was:
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}

We also found a strange bebavior 


> [Spark SQL]  Come across lost task(timeout) or GC OOM error when cross join 
> happen
> --
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 ;
> {code}
> Physical Plan
> {code:sql}
> TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) 
> AS store_ID#446,pr_review_date#449,pr_review_content#455]
> Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
>   CartesianProduct
>HiveTableScan [pr_review_date#449,pr_review_content#455], 
> (MetastoreRelation bigbench, product_reviews, Some(pr))
>HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
> bigbench, temp_stores_with_regression, Some(stores_with_regression))
> Code Generation: true
> {code}
> We also found a strange behavior that exchanging the two table in 'From' 
> clause can pass.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM temp_stores_with_regression stores_with_regression, product_reviews pr
> WHERE 

[jira] [Comment Edited] (SPARK-6350) Make mesosExecutorCores configurable in mesos "fine-grained" mode

2015-09-08 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734575#comment-14734575
 ] 

Iulian Dragos edited comment on SPARK-6350 at 9/8/15 10:26 AM:
---

I'm re-opening this, since in the meantime this regressed. Original fix was in 
commit d86bbb, which regressed in [PR 
4960|https://github.com/apache/spark/pull/4960].

[~tnachen] you might want to take a look at this.


was (Author: dragos):
I'm re-opening this, since in the meantime this regressed. See changes in 
d86bbb, which regressed in [PR 4960|https://github.com/apache/spark/pull/4960].

[~tnachen] you might want to take a look at this.

> Make mesosExecutorCores configurable in mesos "fine-grained" mode
> -
>
> Key: SPARK-6350
> URL: https://issues.apache.org/jira/browse/SPARK-6350
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Jongyoul Lee
>Assignee: Jongyoul Lee
>Priority: Minor
> Fix For: 1.4.0
>
>
> When spark runs in mesos fine-grained mode, mesos slave launches executor 
> with # of cpus and memories. By the way, # of executor's cores is always 
> CPU_PER_TASKS as same as spark.task.cpus. If I set that values as 5 for 
> running intensive task, mesos executor always consume 5 cores without any 
> running task. This waste resources. We should set executor core as a 
> configuration variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-6350) Make mesosExecutorCores configurable in mesos "fine-grained" mode

2015-09-08 Thread Iulian Dragos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iulian Dragos reopened SPARK-6350:
--

I'm re-opening this, since in the meantime this regressed. See changes in 
d86bbb, which regressed in [PR 4960|https://github.com/apache/spark/pull/4960].

[~tnachen] you might want to take a look at this.

> Make mesosExecutorCores configurable in mesos "fine-grained" mode
> -
>
> Key: SPARK-6350
> URL: https://issues.apache.org/jira/browse/SPARK-6350
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Jongyoul Lee
>Assignee: Jongyoul Lee
>Priority: Minor
> Fix For: 1.4.0
>
>
> When spark runs in mesos fine-grained mode, mesos slave launches executor 
> with # of cpus and memories. By the way, # of executor's cores is always 
> CPU_PER_TASKS as same as spark.task.cpus. If I set that values as 5 for 
> running intensive task, mesos executor always consume 5 cores without any 
> running task. This waste resources. We should set executor core as a 
> configuration variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10479) LogisticRegression copy should copy model summary if available

2015-09-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734606#comment-14734606
 ] 

Sean Owen commented on SPARK-10479:
---

This is already being fixed in https://github.com/apache/spark/pull/8641

> LogisticRegression copy should copy model summary if available
> --
>
> Key: SPARK-10479
> URL: https://issues.apache.org/jira/browse/SPARK-10479
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>  Labels: starter
>
> SPARK-9112 adds LogisticRegressionSummary but [does not copy the model 
> summary if 
> available|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L471]
> We should add behavior similar to that in 
> [LinearRegression.copy|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L314]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10486) Spark intermittently fails to recover from a worker failure (in standalone mode)

2015-09-08 Thread Cheuk Lam (JIRA)
Cheuk Lam created SPARK-10486:
-

 Summary: Spark intermittently fails to recover from a worker 
failure (in standalone mode)
 Key: SPARK-10486
 URL: https://issues.apache.org/jira/browse/SPARK-10486
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Cheuk Lam
Priority: Critical


We have run into a problem where some Spark job is aborted after one worker is 
killed in a 2-worker standalone cluster.  The problem is intermittent, but we 
can consistently reproduce it.  The problem only appears to happen when we kill 
a worker.  It doesn't seem to happen when we kill an executor directly.

The program we use to reproduce the problem is some iterative program based on 
GraphX, although the nature of the issue doesn't seem to be GraphX related.  
This is how we reproduce the problem:
* Set up a standalone cluster of 2 workers;
* Run a Spark application of some iterative program (ours is some based on 
GraphX);
* Kill a worker process (and thus the associated executor);
* Intermittently some job will be aborted.

Please see the attached driver log for the sequence of error messages that lead 
to the abortion of a job.  The executor logs are also available, as well as the 
application history (event log file) though it is quite large (700MB).

~

After looking into the log files, we think the failure is caused by the 
following two things combined:
* The BlockManagerMasterEndpoint in the driver has some stale block info 
corresponding to the dead executor after the worker has been killed.  The 
driver does appear to handle the "RemoveExecutor" message and cleans up all 
related block info.  But subsequently, and intermittently, it receives some 
Akka messages to re-register the dead BlockManager and re-add some of its 
blocks.  As a result, upon GetLocations requests from the remaining executor, 
the driver responds with some stale block info, instructing the remaining 
executor to fetch blocks from the dead executor.  Please see the driver log 
excerption below that shows the sequence of events described above.  In the 
log, there are two executors: 1.2.3.4 was the one which got shut down, while 
5.6.7.8 is the remaining executor.  The driver also ran on 5.6.7.8.
* When the remaining executor's BlockManager issues a doGetRemote() call to 
fetch the block of data, it fails because the targeted BlockManager which 
resided in the dead executor is gone.  This failure results in an exception 
forwarded to the caller, bypassing the mechanism in the doGetRemote() function 
to trigger a re-computation of the block.  I don't know whether that is 
intentional or not.

Driver log excerption that shows the dead BlockManager was re-registered at the 
driver followed by some stale block info:

11690 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message 
(172.236378 ms) 
AkkaMessage(RegisterExecutor(0,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/Executor#670388190]),1.2.3.4:36140,8,Map(stdout
 -> 
http://1.2.3.4:8081/logPage/?appId=app-20150902203512-=0=stdout,
 stderr -> 
http://1.2.3.4:8081/logPage/?appId=app-20150902203512-=0=stderr)),true)
 from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$f]

11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message 
AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)
 from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$g]

11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: 
AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)

11718 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] INFO 
BlockManagerMasterEndpoint: Registering block manager 1.2.3.4:52615 with 6.2 GB 
RAM, BlockManagerId(0, 1.2.3.4, 52615)

11719 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message 
(1.498313 ms) AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)
 from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$g]

...

308892 15/09/02 20:40:13 [sparkDriver-akka.actor.default-dispatcher-3] ERROR 
TaskSchedulerImpl: Lost executor 0 on 1.2.3.4: worker lost


[jira] [Updated] (SPARK-10486) Spark intermittently fails to recover from a worker failure (in standalone mode)

2015-09-08 Thread Cheuk Lam (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheuk Lam updated SPARK-10486:
--
Description: 
We have run into a problem where some Spark job is aborted after one worker is 
killed in a 2-worker standalone cluster.  The problem is intermittent, but we 
can consistently reproduce it.  The problem only appears to happen when we kill 
a worker.  It doesn't seem to happen when we kill an executor directly.

The program we use to reproduce the problem is some iterative program based on 
GraphX, although the nature of the issue doesn't seem to be GraphX related.  
This is how we reproduce the problem:
* Set up a standalone cluster of 2 workers;
* Run a Spark application of some iterative program (ours is some based on 
GraphX);
* Kill a worker process (and thus the associated executor);
* Intermittently some job will be aborted.

The driver and the executor logs are available, as well as the application 
history (event log file) though it is quite large.  But they are quite large 
and can't be attached here.

~

After looking into the log files, we think the failure is caused by the 
following two things combined:
* The BlockManagerMasterEndpoint in the driver has some stale block info 
corresponding to the dead executor after the worker has been killed.  The 
driver does appear to handle the "RemoveExecutor" message and cleans up all 
related block info.  But subsequently, and intermittently, it receives some 
Akka messages to re-register the dead BlockManager and re-add some of its 
blocks.  As a result, upon GetLocations requests from the remaining executor, 
the driver responds with some stale block info, instructing the remaining 
executor to fetch blocks from the dead executor.  Please see the driver log 
excerption below that shows the sequence of events described above.  In the 
log, there are two executors: 1.2.3.4 was the one which got shut down, while 
5.6.7.8 is the remaining executor.  The driver also ran on 5.6.7.8.
* When the remaining executor's BlockManager issues a doGetRemote() call to 
fetch the block of data, it fails because the targeted BlockManager which 
resided in the dead executor is gone.  This failure results in an exception 
forwarded to the caller, bypassing the mechanism in the doGetRemote() function 
to trigger a re-computation of the block.  I don't know whether that is 
intentional or not.

Driver log excerption that shows the dead BlockManager was re-registered at the 
driver followed by some stale block info:

11690 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message 
(172.236378 ms) 
AkkaMessage(RegisterExecutor(0,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/Executor#670388190]),1.2.3.4:36140,8,Map(stdout
 -> 
http://1.2.3.4:8081/logPage/?appId=app-20150902203512-=0=stdout,
 stderr -> 
http://1.2.3.4:8081/logPage/?appId=app-20150902203512-=0=stderr)),true)
 from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$f]

11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message 
AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)
 from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$g]

11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: 
AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)

11718 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] INFO 
BlockManagerMasterEndpoint: Registering block manager 1.2.3.4:52615 with 6.2 GB 
RAM, BlockManagerId(0, 1.2.3.4, 52615)

11719 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message 
(1.498313 ms) AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)
 from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$g]

...

308892 15/09/02 20:40:13 [sparkDriver-akka.actor.default-dispatcher-3] ERROR 
TaskSchedulerImpl: Lost executor 0 on 1.2.3.4: worker lost

...

308903 15/09/02 20:40:13 [dag-scheduler-event-loop] INFO DAGScheduler: Executor 
lost: 0 (epoch 178)

308904 15/09/02 20:40:13 [sparkDriver-akka.actor.default-dispatcher-3] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message 
AkkaMessage(RemoveExecutor(0),true) 

[jira] [Commented] (SPARK-5421) SparkSql throw OOM at shuffle

2015-09-08 Thread Romi Kuntsman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734561#comment-14734561
 ] 

Romi Kuntsman commented on SPARK-5421:
--

does this still happen on the latest version?
I got some OOM with Spark 1.4.0

> SparkSql throw OOM at shuffle
> -
>
> Key: SPARK-5421
> URL: https://issues.apache.org/jira/browse/SPARK-5421
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Hong Shen
>
> ExternalAppendOnlyMap if only for the spark job that aggregator isDefined,  
> but sparkSQL's shuffledRDD haven't define aggregator, so sparkSQL won't spill 
> at shuffle, it's very easy to throw OOM at shuffle.  I think sparkSQL also 
> need spill at shuffle.
> One of the executor's log, here is  stderr:
> 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Don't have map outputs 
> for shuffle 1, fetching them
> 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
> actor = 
> Actor[akka.tcp://sparkDriver@10.196.128.140:40952/user/MapOutputTracker#1435377484]
> 15/01/27 07:02:19 INFO spark.MapOutputTrackerWorker: Got the output locations
> 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Getting 143 
> non-empty blocks out of 143 blocks
> 15/01/27 07:02:19 INFO storage.ShuffleBlockFetcherIterator: Started 4 remote 
> fetches in 72 ms
> 15/01/27 07:47:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
> SIGNAL 15: SIGTERM
> here is  stdout:
> 2015-01-27T07:44:43.487+0800: [Full GC 3961343K->3959868K(3961344K), 
> 29.8959290 secs]
> 2015-01-27T07:45:13.460+0800: [Full GC 3961343K->3959992K(3961344K), 
> 27.9218150 secs]
> 2015-01-27T07:45:41.407+0800: [GC 3960347K(3961344K), 3.0457450 secs]
> 2015-01-27T07:45:52.950+0800: [Full GC 3961343K->3960113K(3961344K), 
> 29.3894670 secs]
> 2015-01-27T07:46:22.393+0800: [Full GC 3961118K->3960240K(3961344K), 
> 28.9879600 secs]
> 2015-01-27T07:46:51.393+0800: [Full GC 3960240K->3960213K(3961344K), 
> 34.1530900 secs]
> #
> # java.lang.OutOfMemoryError: Java heap space
> # -XX:OnOutOfMemoryError="kill %p"
> #   Executing /bin/sh -c "kill 9050"...
> 2015-01-27T07:47:25.921+0800: [GC 3960214K(3961344K), 3.3959300 secs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9610) Class and instance weighting for ML

2015-09-08 Thread Nickolay Yakushev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734556#comment-14734556
 ] 

Nickolay Yakushev commented on SPARK-9610:
--

1. Is basic statistics a good candidate for this list?
2. Should we somehow distinguish weight's nature? E.g. fuzzy set or multiset 
(quantitative).
3. Can weight be negative?

> Class and instance weighting for ML
> ---
>
> Key: SPARK-9610
> URL: https://issues.apache.org/jira/browse/SPARK-9610
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This umbrella is for tracking tasks for adding support for label or instance 
> weights to ML algorithms.  These additions will help handle skewed or 
> imbalanced data, ensemble methods, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables do cross join

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Description: 
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}

We also found a workaround that exchanging the two table in 'From' clause can 
pass but get poor performance
Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM temp_stores_with_regression stores_with_regression, product_reviews pr
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) AS 
store_ID#446,pr_review_date#451,pr_review_content#457]
Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
  CartesianProduct
   HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))
   HiveTableScan [pr_review_date#451,pr_review_content#457], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
Code Generation: true
{code}


  was:
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}

We also found a strange behavior that exchanging the two table in 'From' clause 
can pass.
Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM temp_stores_with_regression stores_with_regression, product_reviews pr
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) AS 
store_ID#446,pr_review_date#451,pr_review_content#457]
Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
  CartesianProduct
   HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))
   HiveTableScan [pr_review_date#451,pr_review_content#457], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
Code Generation: true
{code}



> [Spark SQL]  Come across lost task(timeout) or GC OOM error when two tables 
> do cross join
> -
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> lower(pr.pr_review_content), 1) >= 1 

[jira] [Updated] (SPARK-10484) [Spark SQL] Come across lost task(timeout) or GC OOM error when two tables do cross join

2015-09-08 Thread Yi Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-10484:

Description: 
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}


We also found a workaround that exchanging the two table in 'From' clause can 
pass but get poor performance
Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM temp_stores_with_regression stores_with_regression, product_reviews pr
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) AS 
store_ID#446,pr_review_date#451,pr_review_content#457]
Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
  CartesianProduct
   HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))
   HiveTableScan [pr_review_date#451,pr_review_content#457], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
Code Generation: true
{code}


  was:
Found that it lost task or GC OOM when below cross join happen. The left big 
table is ~1.2G in size and  the right small table is ~2.2K.

Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM product_reviews pr, temp_stores_with_regression stores_with_regression
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#456L as string),_,s_store_name#457) AS 
store_ID#446,pr_review_date#449,pr_review_content#455]
Filter (locate(lower(s_store_name#457),lower(pr_review_content#455),1) >= 1)
  CartesianProduct
   HiveTableScan [pr_review_date#449,pr_review_content#455], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
   HiveTableScan [s_store_sk#456L,s_store_name#457], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))

Code Generation: true
{code}

We also found a workaround that exchanging the two table in 'From' clause can 
pass but get poor performance
Key SQL
{code:sql}
SELECT
  CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
  pr_review_date,
  pr_review_content
FROM temp_stores_with_regression stores_with_regression, product_reviews pr
WHERE locate(lower(stores_with_regression.s_store_name), 
lower(pr.pr_review_content), 1) >= 1 ;
{code}

Physical Plan
{code:sql}
TungstenProject [concat(cast(s_store_sk#448L as string),_,s_store_name#449) AS 
store_ID#446,pr_review_date#451,pr_review_content#457]
Filter (locate(lower(s_store_name#449),lower(pr_review_content#457),1) >= 1)
  CartesianProduct
   HiveTableScan [s_store_sk#448L,s_store_name#449], (MetastoreRelation 
bigbench, temp_stores_with_regression, Some(stores_with_regression))
   HiveTableScan [pr_review_date#451,pr_review_content#457], (MetastoreRelation 
bigbench, product_reviews, Some(pr))
Code Generation: true
{code}



> [Spark SQL]  Come across lost task(timeout) or GC OOM error when two tables 
> do cross join
> -
>
> Key: SPARK-10484
> URL: https://issues.apache.org/jira/browse/SPARK-10484
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Priority: Critical
>
> Found that it lost task or GC OOM when below cross join happen. The left big 
> table is ~1.2G in size and  the right small table is ~2.2K.
> Key SQL
> {code:sql}
> SELECT
>   CONCAT(s_store_sk,"_", s_store_name ) AS store_ID, 
>   pr_review_date,
>   pr_review_content
> FROM product_reviews pr, temp_stores_with_regression stores_with_regression
> WHERE locate(lower(stores_with_regression.s_store_name), 
> 

[jira] [Updated] (SPARK-10479) LogisticRegression copy should copy model summary if available

2015-09-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10479:
--
Assignee: Yanbo Liang

> LogisticRegression copy should copy model summary if available
> --
>
> Key: SPARK-10479
> URL: https://issues.apache.org/jira/browse/SPARK-10479
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Yanbo Liang
>Priority: Minor
>  Labels: starter
>
> SPARK-9112 adds LogisticRegressionSummary but [does not copy the model 
> summary if 
> available|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L471]
> We should add behavior similar to that in 
> [LinearRegression.copy|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L314]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10288) Add a rest client for Spark on Yarn

2015-09-08 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734637#comment-14734637
 ] 

Steve Loughran commented on SPARK-10288:


The long-haul filesystem communications is addressed by not talking HDFS RPC, 
but instead using webhdfs:// on an HDFS cluster, s3 on an EMR cluster, swift:// 
for openstack, and avs on Azure; even if an AWS cluster is backed by HDFS, the 
NodeManagers will still localize off an S3 URL if asked. What all these 
protocols offer is an HTTP/S file API which can go through proxies including 
anything which is restricting cluster access to a few HTTP ports, that is, with 
the core RPC protocols utterly inaccessible outside the cluster. This fits the 
use cases of cloud deployment and locked-down on-prem hadoop clusters.

Security for FS access comes with the filesystems; webhdfs has its REST client 
which handles tokens. Looking at the YARN source, I don't see a REST client 
there; even if one went in now it would be 2.8+ only.

The WiP SPARK-1537 code does have a REST client which supports Authentication 
Token-authed connections to Hadoop REST endpoints under Jersey; the code in the 
rest/ package is not specific to the timeline endpoint. That code could be 
pulled into spark-core. It doesn't do delegation tokens, but the YARN job 
submission use case doesn't need it.

> Add a rest client for Spark on Yarn
> ---
>
> Key: SPARK-10288
> URL: https://issues.apache.org/jira/browse/SPARK-10288
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Saisai Shao
>
> This is a proposal to add rest client for Spark on Yarn. Rest API offers a 
> convenient addition to let user to submit application through rest client, 
> people will easily achieve long haul submission, build their own submission 
> gateway through rest client.
> Here is the design doc 
> (https://docs.google.com/document/d/1m_P-4olXrp0tJ3kEOLZh1rwrjTfAat7P3fAVPR5GTmg/edit?usp=sharing).
> Currently I'm working on it, working branch is 
> (https://github.com/jerryshao/apache-spark/tree/yarn-rest-support), the major 
> part is already finished.
> Any comment is greatly appreciated, thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-09-08 Thread Yi Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734721#comment-14734721
 ] 

Yi Zhou commented on SPARK-5791:


[~yhuai], Yes. Thank you !

> [Spark SQL] show poor performance when multiple table do join operation
> ---
>
> Key: SPARK-5791
> URL: https://issues.apache.org/jira/browse/SPARK-5791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Yi Zhou
> Attachments: Physcial_Plan_Hive.txt, 
> Physcial_Plan_SparkSQL_Updated.txt, Physical_Plan.txt
>
>
> Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10480) ML.LinearRegressionModel.copy() can not use argument "extra"

2015-09-08 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10480:
--
Assignee: Yanbo Liang

> ML.LinearRegressionModel.copy() can not use argument "extra"
> 
>
> Key: SPARK-10480
> URL: https://issues.apache.org/jira/browse/SPARK-10480
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> ML.LinearRegressionModel.copy() ignored argument extra, it will not take 
> effect when users setting this parameter. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6350) Make mesosExecutorCores configurable in mesos "fine-grained" mode

2015-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734706#comment-14734706
 ] 

Apache Spark commented on SPARK-6350:
-

User 'dragos' has created a pull request for this issue:
https://github.com/apache/spark/pull/8653

> Make mesosExecutorCores configurable in mesos "fine-grained" mode
> -
>
> Key: SPARK-6350
> URL: https://issues.apache.org/jira/browse/SPARK-6350
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: Jongyoul Lee
>Assignee: Jongyoul Lee
>Priority: Minor
> Fix For: 1.4.0
>
>
> When spark runs in mesos fine-grained mode, mesos slave launches executor 
> with # of cpus and memories. By the way, # of executor's cores is always 
> CPU_PER_TASKS as same as spark.task.cpus. If I set that values as 5 for 
> running intensive task, mesos executor always consume 5 cores without any 
> running task. This waste resources. We should set executor core as a 
> configuration variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10486) Spark intermittently fails to recover from a worker failure (in standalone mode)

2015-09-08 Thread Cheuk Lam (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheuk Lam updated SPARK-10486:
--
Description: 
We have run into a problem where some Spark job is aborted after one worker is 
killed in a 2-worker standalone cluster.  The problem is intermittent, but we 
can consistently reproduce it.  The problem only appears to happen when we kill 
a worker.  It doesn't seem to happen when we kill an executor directly.

The program we use to reproduce the problem is some iterative program based on 
GraphX, although the nature of the issue doesn't seem to be GraphX related.  
This is how we reproduce the problem:
* Set up a standalone cluster of 2 workers;
* Run a Spark application of some iterative program (ours is some based on 
GraphX);
* Kill a worker process (and thus the associated executor);
* Intermittently some job will be aborted.

The driver and the executor logs are available, as well as the application 
history (event log file).  But they are quite large and can't be attached here.

~

After looking into the log files, we think the failure is caused by the 
following two things combined:
* The BlockManagerMasterEndpoint in the driver has some stale block info 
corresponding to the dead executor after the worker has been killed.  The 
driver does appear to handle the "RemoveExecutor" message and cleans up all 
related block info.  But subsequently, and intermittently, it receives some 
Akka messages to re-register the dead BlockManager and re-add some of its 
blocks.  As a result, upon GetLocations requests from the remaining executor, 
the driver responds with some stale block info, instructing the remaining 
executor to fetch blocks from the dead executor.  Please see the driver log 
excerption below that shows the sequence of events described above.  In the 
log, there are two executors: 1.2.3.4 was the one which got shut down, while 
5.6.7.8 is the remaining executor.  The driver also ran on 5.6.7.8.
* When the remaining executor's BlockManager issues a doGetRemote() call to 
fetch the block of data, it fails because the targeted BlockManager which 
resided in the dead executor is gone.  This failure results in an exception 
forwarded to the caller, bypassing the mechanism in the doGetRemote() function 
to trigger a re-computation of the block.  I don't know whether that is 
intentional or not.

Driver log excerption that shows the dead BlockManager was re-registered at the 
driver followed by some stale block info:

11690 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message 
(172.236378 ms) 
AkkaMessage(RegisterExecutor(0,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/Executor#670388190]),1.2.3.4:36140,8,Map(stdout
 -> 
http://1.2.3.4:8081/logPage/?appId=app-20150902203512-=0=stdout,
 stderr -> 
http://1.2.3.4:8081/logPage/?appId=app-20150902203512-=0=stderr)),true)
 from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$f]

11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message 
AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)
 from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$g]

11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: 
AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)

11718 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] INFO 
BlockManagerMasterEndpoint: Registering block manager 1.2.3.4:52615 with 6.2 GB 
RAM, BlockManagerId(0, 1.2.3.4, 52615)

11719 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message 
(1.498313 ms) AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)
 from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$g]

...

308892 15/09/02 20:40:13 [sparkDriver-akka.actor.default-dispatcher-3] ERROR 
TaskSchedulerImpl: Lost executor 0 on 1.2.3.4: worker lost

...

308903 15/09/02 20:40:13 [dag-scheduler-event-loop] INFO DAGScheduler: Executor 
lost: 0 (epoch 178)

308904 15/09/02 20:40:13 [sparkDriver-akka.actor.default-dispatcher-3] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message 
AkkaMessage(RemoveExecutor(0),true) from 

[jira] [Commented] (SPARK-9435) Java UDFs don't work with GROUP BY expressions

2015-09-08 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735416#comment-14735416
 ] 

Michael Armbrust commented on SPARK-9435:
-

>From a quick glance, the problem is likely that the {{equals}} function on 
>Java UDFs is working correctly.  As a workaround you could probably calculate 
>the udf in a nested select.

> Java UDFs don't work with GROUP BY expressions
> --
>
> Key: SPARK-9435
> URL: https://issues.apache.org/jira/browse/SPARK-9435
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: All
>Reporter: James Aley
> Attachments: IncMain.java, points.txt
>
>
> If you define a UDF in Java, for example by implementing the UDF1 interface, 
> then try to use that UDF on a column in both the SELECT and GROUP BY clauses 
> of a query, you'll get an error like this:
> {code}
> "SELECT inc(y),COUNT(DISTINCT x) FROM test_table GROUP BY inc(y)"
> org.apache.spark.sql.AnalysisException: expression 'y' is neither present in 
> the group by, nor is it an aggregate function. Add to group by or wrap in 
> first() if you don't care which value you get.
> {code}
> We put together a minimal reproduction in the attached Java file, which makes 
> use of the data in the text file attached.
> I'm guessing there's some kind of issue with the equality implementation, so 
> Spark can't tell that those two expressions are the same maybe? If you do the 
> same thing from Scala, it works fine.
> Note for context: we ran into this issue while working around SPARK-9338.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10441) Cannot write timestamp to JSON

2015-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735454#comment-14735454
 ] 

Apache Spark commented on SPARK-10441:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/8655

> Cannot write timestamp to JSON
> --
>
> Key: SPARK-10441
> URL: https://issues.apache.org/jira/browse/SPARK-10441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10492) Update Streaming documentation about rate limiting and backpressure

2015-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10492:


Assignee: Tathagata Das  (was: Apache Spark)

> Update Streaming documentation about rate limiting and backpressure
> ---
>
> Key: SPARK-10492
> URL: https://issues.apache.org/jira/browse/SPARK-10492
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Affects Versions: 1.5.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10492) Update Streaming documentation about rate limiting and backpressure

2015-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10492:


Assignee: Apache Spark  (was: Tathagata Das)

> Update Streaming documentation about rate limiting and backpressure
> ---
>
> Key: SPARK-10492
> URL: https://issues.apache.org/jira/browse/SPARK-10492
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Affects Versions: 1.5.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735392#comment-14735392
 ] 

Davies Liu commented on SPARK-8632:
---

[~rxin] As [~justin.uang] suggested before, the batch mode will need to flush 
the rows in every place of the pipeline, or it get deadlock.

I think the goal is to call upstream once and improve the throughput of Python 
UDF (which is usually the bottleneck). The batch mode is increase the overhead 
of Python UDF (for each batch), cause worser performance. The problem of older 
cache is that serialization and memory management (also not purged after used) 
overhead. With one time (purged after visited) tungsten cache (and spilling), 
the overhead should be not that high, I think this should be the most 
performant and stable approach.

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-08 Thread Justin Uang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735406#comment-14735406
 ] 

Justin Uang commented on SPARK-8632:


Davies, what do you mean by upstream? I didn't quite understand what you
meant.

I have implemented a batch based system that is synchronous instead, so it
doesn't risk deadlock. (There isn't a writer and reader thread anymore)



> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10470) ml.IsotonicRegressionModel.copy did not set parent

2015-09-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10470.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8637
[https://github.com/apache/spark/pull/8637]

> ml.IsotonicRegressionModel.copy did not set parent
> --
>
> Key: SPARK-10470
> URL: https://issues.apache.org/jira/browse/SPARK-10470
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
> Fix For: 1.6.0
>
>
> ml.IsotonicRegressionModel.copy did not set parent



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10492) Update Streaming documentation about rate limiting and backpressure

2015-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735510#comment-14735510
 ] 

Apache Spark commented on SPARK-10492:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/8656

> Update Streaming documentation about rate limiting and backpressure
> ---
>
> Key: SPARK-10492
> URL: https://issues.apache.org/jira/browse/SPARK-10492
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Affects Versions: 1.5.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10470) ml.IsotonicRegressionModel.copy did not set parent

2015-09-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10470:
--
Fix Version/s: 1.5.1

> ml.IsotonicRegressionModel.copy did not set parent
> --
>
> Key: SPARK-10470
> URL: https://issues.apache.org/jira/browse/SPARK-10470
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 1.6.0, 1.5.1
>
>
> ml.IsotonicRegressionModel.copy did not set parent



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-09-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735401#comment-14735401
 ] 

Davies Liu commented on SPARK-10309:


[~nadenf] In my case, the job finally finished (after retry), so this seems to 
be a blocker for me.

Could you provide more information about you job?

> Some tasks failed with Unable to acquire memory
> ---
>
> Key: SPARK-10309
> URL: https://issues.apache.org/jira/browse/SPARK-10309
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>
> While running Q53 of TPCDS (scale = 1500) on 24 nodes cluster (12G memory on 
> executor):
> {code}
> java.io.IOException: Unable to acquire 33554432 bytes of memory
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
> at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
> at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> The task could finished after retry.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10492) Update Streaming documentation about rate limiting and backpressure

2015-09-08 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-10492:
-

 Summary: Update Streaming documentation about rate limiting and 
backpressure
 Key: SPARK-10492
 URL: https://issues.apache.org/jira/browse/SPARK-10492
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Streaming
Affects Versions: 1.5.0
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10470) ml.IsotonicRegressionModel.copy did not set parent

2015-09-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10470:
--
Target Version/s: 1.6.0, 1.5.1

> ml.IsotonicRegressionModel.copy did not set parent
> --
>
> Key: SPARK-10470
> URL: https://issues.apache.org/jira/browse/SPARK-10470
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 1.6.0
>
>
> ml.IsotonicRegressionModel.copy did not set parent



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10493) reduceByKey not returning distinct results

2015-09-08 Thread Glenn Strycker (JIRA)
Glenn Strycker created SPARK-10493:
--

 Summary: reduceByKey not returning distinct results
 Key: SPARK-10493
 URL: https://issues.apache.org/jira/browse/SPARK-10493
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Glenn Strycker


I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
(using zipPartitions), partitioning by a hash partitioner, and then applying a 
reduceByKey to summarize statistics by key.

Since my set before the reduceByKey consists of records such as (K, V1), (K, 
V2), (K, V3), I expect the results after reduceByKey to be just (K, 
f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
etc.  Therefore, the results after reduceByKey ought to be distinct, correct?  
I am running counts of my RDD and finding that adding an additional .distinct 
after my .reduceByKey is changing the final count!!

Here is some example code:

rdd3 = tempRDD1.
   zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
   partitionBy(new HashPartitioner(numPartitions)).
   reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))

println(rdd3.count)

rdd4 = rdd3.distinct
println(rdd4.count)

I am using persistence, checkpointing, and other stuff in my actual code that I 
did not paste here, so I can paste my actual code if it would be helpful.

This issue may be related to SPARK-2620, except I am not using case classes, to 
my knowledge.

See also 
http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10373) Move @since annotator to pyspark to be shared by all components

2015-09-08 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735520#comment-14735520
 ] 

Xiangrui Meng commented on SPARK-10373:
---

No, this is for 1.6.

> Move @since annotator to pyspark to be shared by all components
> ---
>
> Key: SPARK-10373
> URL: https://issues.apache.org/jira/browse/SPARK-10373
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Davies Liu
>
> Python's `@since` is defined under `pyspark.sql`. It would be nice to move it 
> under `pyspark` to be shared by all components.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10316) respect non-deterministic expressions in PhysicalOperation

2015-09-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10316.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8486
[https://github.com/apache/spark/pull/8486]

> respect non-deterministic expressions in PhysicalOperation
> --
>
> Key: SPARK-10316
> URL: https://issues.apache.org/jira/browse/SPARK-10316
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>
> We did a lot of special handling for non-deterministic expressions in 
> Optimizer. However, PhysicalOperation just collects all Projects and Filters 
> and messed it up. We should respect the operators order caused by 
> non-deterministic expressions in PhysicalOperation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10470) ml.IsotonicRegressionModel.copy did not set parent

2015-09-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10470:
--
Assignee: Yanbo Liang

> ml.IsotonicRegressionModel.copy did not set parent
> --
>
> Key: SPARK-10470
> URL: https://issues.apache.org/jira/browse/SPARK-10470
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 1.6.0
>
>
> ml.IsotonicRegressionModel.copy did not set parent



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB

2015-09-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-6101.
--
   Resolution: Won't Fix
 Assignee: (was: Chris Fregly)
Fix Version/s: (was: 1.6.0)

I'm going to close this one since it should be a data source living outside of 
Spark, similar to other data sources (e.g. avro, csv, redshift, mongodb, 
cassandra).

Feel free to use this ticket to exchange information and point people to 
3rd-party implementations.


> Create a SparkSQL DataSource API implementation for DynamoDB
> 
>
> Key: SPARK-6101
> URL: https://issues.apache.org/jira/browse/SPARK-6101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Chris Fregly
>
> similar to https://github.com/databricks/spark-avro  and 
> https://github.com/databricks/spark-csv
> Here's a good basis for a java-based, high-level dynamodb java connector:  
> https://github.com/sporcina/dynamodb-connector/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9014) Allow Python spark API to use built-in exponential operator

2015-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9014:
---

Assignee: (was: Apache Spark)

> Allow Python spark API to use built-in exponential operator
> ---
>
> Key: SPARK-9014
> URL: https://issues.apache.org/jira/browse/SPARK-9014
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Jon Speiser
>Priority: Minor
>
> It would be nice if instead of saying:
> import pyspark.sql.functions as funcs
> df = df.withColumn("standarderror", funcs.sqrt(df["variance"]))
> ...if I could simply say:
> df = df.withColumn("standarderror", df["variance"] ** 0.5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9014) Allow Python spark API to use built-in exponential operator

2015-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9014:
---

Assignee: Apache Spark

> Allow Python spark API to use built-in exponential operator
> ---
>
> Key: SPARK-9014
> URL: https://issues.apache.org/jira/browse/SPARK-9014
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Jon Speiser
>Assignee: Apache Spark
>Priority: Minor
>
> It would be nice if instead of saying:
> import pyspark.sql.functions as funcs
> df = df.withColumn("standarderror", funcs.sqrt(df["variance"]))
> ...if I could simply say:
> df = df.withColumn("standarderror", df["variance"] ** 0.5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9014) Allow Python spark API to use built-in exponential operator

2015-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735664#comment-14735664
 ] 

Apache Spark commented on SPARK-9014:
-

User '0x0FFF' has created a pull request for this issue:
https://github.com/apache/spark/pull/8658

> Allow Python spark API to use built-in exponential operator
> ---
>
> Key: SPARK-9014
> URL: https://issues.apache.org/jira/browse/SPARK-9014
> Project: Spark
>  Issue Type: Wish
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Jon Speiser
>Priority: Minor
>
> It would be nice if instead of saying:
> import pyspark.sql.functions as funcs
> df = df.withColumn("standarderror", funcs.sqrt(df["variance"]))
> ...if I could simply say:
> df = df.withColumn("standarderror", df["variance"] ** 0.5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10442) select cast('false' as boolean) returns true

2015-09-08 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735682#comment-14735682
 ] 

Yin Huai commented on SPARK-10442:
--

A related Hive jira is https://issues.apache.org/jira/browse/HIVE-3604.

> select cast('false' as boolean) returns true
> 
>
> Key: SPARK-10442
> URL: https://issues.apache.org/jira/browse/SPARK-10442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9769) Add Python API for ml.feature.CountVectorizer

2015-09-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9769:
-
Summary: Add Python API for ml.feature.CountVectorizer  (was: Add Python 
API for ml.feature.CountVectorizerModel)

> Add Python API for ml.feature.CountVectorizer
> -
>
> Key: SPARK-9769
> URL: https://issues.apache.org/jira/browse/SPARK-9769
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add Python API, user guide and example for ml.feature.CountVectorizerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10441) Cannot write timestamp to JSON

2015-09-08 Thread Don Drake (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735690#comment-14735690
 ] 

Don Drake commented on SPARK-10441:
---

I see that PR 8597 was merged into master.  Does master represent 1.5.1?  I'm 
curious if this will be part of 1.5.0 as it's blocking my from upgrading at the 
moment.

Thanks.

> Cannot write timestamp to JSON
> --
>
> Key: SPARK-10441
> URL: https://issues.apache.org/jira/browse/SPARK-10441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10482) Add Python interface for CountVectorizer

2015-09-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-10482.
-
Resolution: Duplicate

> Add Python interface for CountVectorizer
> 
>
> Key: SPARK-10482
> URL: https://issues.apache.org/jira/browse/SPARK-10482
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> Add Python interface for ml.CountVectorizer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9769) Add Python API for ml.feature.CountVectorizer

2015-09-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9769:
-
Assignee: holdenk
Target Version/s: 1.6.0
Priority: Major  (was: Minor)
  Issue Type: New Feature  (was: Improvement)

> Add Python API for ml.feature.CountVectorizer
> -
>
> Key: SPARK-9769
> URL: https://issues.apache.org/jira/browse/SPARK-9769
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: holdenk
>
> Add Python API, user guide and example for ml.feature.CountVectorizerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10408) Autoencoder

2015-09-08 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735706#comment-14735706
 ] 

Debasish Das commented on SPARK-10408:
--

[~avulanov] In MLP can we change BFGS to OWLQN and get L1 regularization ? That 
way I can get sparse weights and clean up the network to avoid 
overfitting...For the autoencoder did you experiment with graphx based design ? 
I would like to work on it. Basically the idea is to come up with a N layer 
deep autoencoder that can support similar prediction APIs like matrix 
factorization.

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers: 
> References: 
> 1-3. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf
> 4. http://machinelearning.wustl.edu/mlpapers/paper_files/NIPS2006_739.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail

2015-09-08 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735735#comment-14735735
 ] 

Yin Huai commented on SPARK-10301:
--

[~lian cheng] Let's also have a follow-up pr for the master branch to address 
post-hoc review comments.

> For struct type, if parquet's global schema has less fields than a file's 
> schema, data reading will fail
> 
>
> Key: SPARK-10301
> URL: https://issues.apache.org/jira/browse/SPARK-10301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Cheng Lian
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.6.0
>
>
> We hit this issue when reading a complex Parquet dateset without turning on 
> schema merging.  The data set consists of Parquet files with different but 
> compatible schemas.  In this way, the schema of the dataset is defined by 
> either a summary file or a random physical Parquet file if no summary files 
> are available.  Apparently, this schema may not containing all fields 
> appeared in all physicla files.
> Parquet was designed with schema evolution and column pruning in mind, so it 
> should be legal for a user to use a tailored schema to read the dataset to 
> save disk IO.  For example, say we have a Parquet dataset consisting of two 
> physical Parquet files with the following two schemas:
> {noformat}
> message m0 {
>   optional group f0 {
> optional int64 f00;
> optional int64 f01;
>   }
> }
> message m1 {
>   optional group f0 {
> optional int64 f01;
> optional int64 f01;
> optional int64 f02;
>   }
>   optional double f1;
> }
> {noformat}
> Users should be allowed to read the dataset with the following schema:
> {noformat}
> message m1 {
>   optional group f0 {
> optional int64 f01;
> optional int64 f02;
>   }
> }
> {noformat}
> so that {{f0.f00}} and {{f1}} are never touched.  The above case can be 
> expressed by the following {{spark-shell}} snippet:
> {noformat}
> import sqlContext._
> import sqlContext.implicits._
> import org.apache.spark.sql.types.{LongType, StructType}
> val path = "/tmp/spark/parquet"
> range(3).selectExpr("NAMED_STRUCT('f00', id, 'f01', id) AS f0").coalesce(1)
> .write.mode("overwrite").parquet(path)
> range(3).selectExpr("NAMED_STRUCT('f00', id, 'f01', id, 'f02', id) AS f0", 
> "CAST(id AS DOUBLE) AS f1").coalesce(1)
> .write.mode("append").parquet(path)
> val tailoredSchema =
>   new StructType()
> .add(
>   "f0",
>   new StructType()
> .add("f01", LongType, nullable = true)
> .add("f02", LongType, nullable = true),
>   nullable = true)
> read.schema(tailoredSchema).parquet(path).show()
> {noformat}
> Expected output should be:
> {noformat}
> ++
> |  f0|
> ++
> |[0,null]|
> |[1,null]|
> |[2,null]|
> |   [0,0]|
> |   [1,1]|
> |   [2,2]|
> ++
> {noformat}
> However, current 1.5-SNAPSHOT version throws the following exception:
> {noformat}
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> hdfs://localhost:9000/tmp/spark/parquet/part-r-0-56c4604e-c546-4f97-a316-05da8ab1a0bf.gz.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
> at 
> 

[jira] [Updated] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10466:

Target Version/s: 1.5.1  (was: 1.5.0)

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Blocker
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at 

[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2015-09-08 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735557#comment-14735557
 ] 

Glenn Strycker commented on SPARK-2620:
---

I am finding similar behavior for a non-case-class RDD... see SPARK-10493

> case class cannot be used as key for reduce
> ---
>
> Key: SPARK-2620
> URL: https://issues.apache.org/jira/browse/SPARK-2620
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.0.0, 1.1.0
> Environment: reproduced on spark-shell local[4]
>Reporter: Gerard Maas
>Assignee: Tobias Schlatter
>Priority: Critical
>  Labels: case-class, core
>
> Using a case class as a key doesn't seem to work properly on Spark 1.0.0
> A minimal example:
> case class P(name:String)
> val ps = Array(P("alice"), P("bob"), P("charly"), P("bob"))
> sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect
> [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
> (P(bob),1), (P(abe),1), (P(charly),1))
> In contrast to the expected behavior, that should be equivalent to:
> sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect
> Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
> groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-08 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735626#comment-14735626
 ] 

Glenn Strycker commented on SPARK-10493:


Thanks for the speedy follow-up, [~frosner]!

I'm attempting now as we speak to replicate this issue in Spark Shell so that I 
can find a nice minimal example and rule out issues specific to my application. 
 I'll update this ticket when I have a better idea of what's happening.

For now, I'm submitting via spark-submit to YARN using cluster mode.  We have a 
specific set up at our company, so my actual submit script includes custom 
configs and such.  Here's a bit of what I submit, so please let me know if you 
need a particular environment variable.  Adding [~aagottlieb] and [~peterwoj] 
as watchers to this ticket so they can comment with appropriate additional 
details.

/opt/spark/bin/spark-submit --conf spark.shuffle.service.enabled=true 
--properties-file spark.conf --files log4j.properties --jars  
--class myProgram --num-executors 428 --driver-memory 25G --executor-memory 20G 
target/scala-2.10/indid3p0_2.10-1.0.jar

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10441) Cannot write timestamp to JSON

2015-09-08 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735702#comment-14735702
 ] 

Yin Huai commented on SPARK-10441:
--

[~dondrake] https://github.com/apache/spark/pull/8655 is the 1.5 branch 
backport of the fix. I just merged it into branch 1.5. 

Looks like this fix will not be in 1.5.0 (since the vote of releasing 1.5.0 
based on 1.5.0 RC4 just passed and RC3 does not have this fix). But, this fix 
will be in the 1.5.1 release.

> Cannot write timestamp to JSON
> --
>
> Key: SPARK-10441
> URL: https://issues.apache.org/jira/browse/SPARK-10441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail

2015-09-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10301:
-
Labels: backport-needed  (was: )

> For struct type, if parquet's global schema has less fields than a file's 
> schema, data reading will fail
> 
>
> Key: SPARK-10301
> URL: https://issues.apache.org/jira/browse/SPARK-10301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Cheng Lian
>Priority: Critical
>  Labels: backport-needed
> Fix For: 1.6.0
>
>
> We hit this issue when reading a complex Parquet dateset without turning on 
> schema merging.  The data set consists of Parquet files with different but 
> compatible schemas.  In this way, the schema of the dataset is defined by 
> either a summary file or a random physical Parquet file if no summary files 
> are available.  Apparently, this schema may not containing all fields 
> appeared in all physicla files.
> Parquet was designed with schema evolution and column pruning in mind, so it 
> should be legal for a user to use a tailored schema to read the dataset to 
> save disk IO.  For example, say we have a Parquet dataset consisting of two 
> physical Parquet files with the following two schemas:
> {noformat}
> message m0 {
>   optional group f0 {
> optional int64 f00;
> optional int64 f01;
>   }
> }
> message m1 {
>   optional group f0 {
> optional int64 f01;
> optional int64 f01;
> optional int64 f02;
>   }
>   optional double f1;
> }
> {noformat}
> Users should be allowed to read the dataset with the following schema:
> {noformat}
> message m1 {
>   optional group f0 {
> optional int64 f01;
> optional int64 f02;
>   }
> }
> {noformat}
> so that {{f0.f00}} and {{f1}} are never touched.  The above case can be 
> expressed by the following {{spark-shell}} snippet:
> {noformat}
> import sqlContext._
> import sqlContext.implicits._
> import org.apache.spark.sql.types.{LongType, StructType}
> val path = "/tmp/spark/parquet"
> range(3).selectExpr("NAMED_STRUCT('f00', id, 'f01', id) AS f0").coalesce(1)
> .write.mode("overwrite").parquet(path)
> range(3).selectExpr("NAMED_STRUCT('f00', id, 'f01', id, 'f02', id) AS f0", 
> "CAST(id AS DOUBLE) AS f1").coalesce(1)
> .write.mode("append").parquet(path)
> val tailoredSchema =
>   new StructType()
> .add(
>   "f0",
>   new StructType()
> .add("f01", LongType, nullable = true)
> .add("f02", LongType, nullable = true),
>   nullable = true)
> read.schema(tailoredSchema).parquet(path).show()
> {noformat}
> Expected output should be:
> {noformat}
> ++
> |  f0|
> ++
> |[0,null]|
> |[1,null]|
> |[2,null]|
> |   [0,0]|
> |   [1,1]|
> |   [2,2]|
> ++
> {noformat}
> However, current 1.5-SNAPSHOT version throws the following exception:
> {noformat}
> org.apache.parquet.io.ParquetDecodingException: Can not read value at 0 in 
> block -1 in file 
> hdfs://localhost:9000/tmp/spark/parquet/part-r-0-56c4604e-c546-4f97-a316-05da8ab1a0bf.gz.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
> at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
> at 
> 

[jira] [Resolved] (SPARK-10428) Struct fields read from parquet are mis-aligned

2015-09-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10428.
--
Resolution: Fixed

> Struct fields read from parquet are mis-aligned
> ---
>
> Key: SPARK-10428
> URL: https://issues.apache.org/jira/browse/SPARK-10428
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Critical
>
> {code}
> val df1 = sqlContext
> .range(1)
> .selectExpr("NAMED_STRUCT('a', id, 'd', id + 3) AS s")
> .coalesce(1)
> val df2 = sqlContext
>   .range(1, 2)
>   .selectExpr("NAMED_STRUCT('a', id, 'b', id + 1, 'c', id + 2, 'd', id + 3) 
> AS s")
>   .coalesce(1)
> df1.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=1")
> df2.write.mode("overwrite").parquet("/home/yin/sc_11_minimal/p=2")
> {code}
> {code}
> sqlContext.read.option("mergeSchema", 
> "true").parquet("/home/yin/sc_11_minimal/").selectExpr("s.a", "s.b", "s.c", 
> "s.d", “p").show
> +---+---+++---+
> |  a|  b|   c|   d|  p|
> +---+---+++---+
> |  0|  3|null|null|  1|
> |  1|  2|   3|   4|  2|
> +---+---+++---+
> {code}
> Looks like the problem is at 
> https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L185-L204,
>  we do padding when global schema has more struct fields than local parquet 
> file's schema. However, when we read field from parquet, we still use 
> parquet's local schema and then we put the value of {{d}} to the wrong slot.
> I tried master. Looks like this issue is resolved by 
> https://github.com/apache/spark/pull/8509. We need to decide if we want to 
> back port that to branch 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9503) Mesos dispatcher NullPointerException (MesosClusterScheduler)

2015-09-08 Thread Sal Uryasev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735750#comment-14735750
 ] 

Sal Uryasev commented on SPARK-9503:


Someone on my team is hitting the same bug.

There is something suspicious going on within the code that may be the cause:
removeFromQueuedDrivers is called while looping through queuedDrivers, calling 
"queuedDrivers.remove(index)" .

> Mesos dispatcher NullPointerException (MesosClusterScheduler)
> -
>
> Key: SPARK-9503
> URL: https://issues.apache.org/jira/browse/SPARK-9503
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.1
> Environment: branch-1.4 #8dfdca46dd2f527bf653ea96777b23652bc4eb83
>Reporter: Sebastian YEPES FERNANDEZ
>  Labels: mesosphere
>
> Hello,
> I have just started using start-mesos-dispatcher and have been noticing that 
> some random crashes NPE's
> By looking at the exception it looks like in certain situations the 
> "queuedDrivers" is empty and causes the NPE "submission.cores"
> https://github.com/apache/spark/blob/branch-1.4/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L512-L516
> {code:title=log|borderStyle=solid}
> 15/07/30 23:56:44 INFO MesosRestServer: Started REST server for submitting 
> applications on port 7077
> Exception in thread "Thread-1647" java.lang.NullPointerException
> at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:437)
> at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler$$anonfun$scheduleTasks$1.apply(MesosClusterScheduler.scala:436)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.scheduleTasks(MesosClusterScheduler.scala:436)
> at 
> org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler.resourceOffers(MesosClusterScheduler.scala:512)
> I0731 00:53:52.969518  7014 sched.cpp:1625] Asked to abort the driver
> I0731 00:53:52.969895  7014 sched.cpp:861] Aborting framework 
> '20150730-234528-4261456064-5050-61754-'
> 15/07/31 00:53:52 INFO MesosClusterScheduler: driver.run() returned with code 
> DRIVER_ABORTED
> {code}
> A side effect of this NPE is that after the crash the dispatcher will not 
> start because its already registered #SPARK-7831
> {code:title=log|borderStyle=solid}
> 15/07/31 09:55:47 INFO MesosClusterUI: Started MesosClusterUI at 
> http://192.168.0.254:8081
> I0731 09:55:47.715039  8162 sched.cpp:157] Version: 0.23.0
> I0731 09:55:47.717013  8163 sched.cpp:254] New master detected at 
> master@192.168.0.254:5050
> I0731 09:55:47.717381  8163 sched.cpp:264] No credentials provided. 
> Attempting to register without authentication
> I0731 09:55:47.718246  8177 sched.cpp:819] Got error 'Completed framework 
> attempted to re-register'
> I0731 09:55:47.718268  8177 sched.cpp:1625] Asked to abort the driver
> 15/07/31 09:55:47 ERROR MesosClusterScheduler: Error received: Completed 
> framework attempted to re-register
> I0731 09:55:47.719091  8177 sched.cpp:861] Aborting framework 
> '20150730-234528-4261456064-5050-61754-0038'
> 15/07/31 09:55:47 INFO MesosClusterScheduler: driver.run() returned with code 
> DRIVER_ABORTED
> 15/07/31 09:55:47 INFO Utils: Shutdown hook called
> {code}
> I can get around this by removing the zk data:
> {code:title=zkCli.sh|borderStyle=solid}
> rmr /spark_mesos_dispatcher
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB

2015-09-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6101:
---
Affects Version/s: (was: 1.2.0)

> Create a SparkSQL DataSource API implementation for DynamoDB
> 
>
> Key: SPARK-6101
> URL: https://issues.apache.org/jira/browse/SPARK-6101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Chris Fregly
>
> similar to https://github.com/databricks/spark-avro  and 
> https://github.com/databricks/spark-csv
> Here's a good basis for a java-based, high-level dynamodb java connector:  
> https://github.com/sporcina/dynamodb-connector/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10373) Move @since annotator to pyspark to be shared by all components

2015-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10373:


Assignee: Apache Spark  (was: Davies Liu)

> Move @since annotator to pyspark to be shared by all components
> ---
>
> Key: SPARK-10373
> URL: https://issues.apache.org/jira/browse/SPARK-10373
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> Python's `@since` is defined under `pyspark.sql`. It would be nice to move it 
> under `pyspark` to be shared by all components.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10373) Move @since annotator to pyspark to be shared by all components

2015-09-08 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10373:


Assignee: Davies Liu  (was: Apache Spark)

> Move @since annotator to pyspark to be shared by all components
> ---
>
> Key: SPARK-10373
> URL: https://issues.apache.org/jira/browse/SPARK-10373
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Davies Liu
>
> Python's `@since` is defined under `pyspark.sql`. It would be nice to move it 
> under `pyspark` to be shared by all components.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10373) Move @since annotator to pyspark to be shared by all components

2015-09-08 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735581#comment-14735581
 ] 

Apache Spark commented on SPARK-10373:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/8657

> Move @since annotator to pyspark to be shared by all components
> ---
>
> Key: SPARK-10373
> URL: https://issues.apache.org/jira/browse/SPARK-10373
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Davies Liu
>
> Python's `@since` is defined under `pyspark.sql`. It would be nice to move it 
> under `pyspark` to be shared by all components.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-08 Thread Glenn Strycker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735653#comment-14735653
 ] 

Glenn Strycker commented on SPARK-10493:


Note: this only seems to be occurring "at scale" so far.  I haven't noticed 
this issue for any of my unit tests, which are on the order of many 100 
records, but the issue I'm currently seeing is on an RDD with approximately 1B 
records.  As you can see in my submit command above, I'm running on 428 
executors, and inside of my spark program I am requesting 2568 partitions, 
which is a 6x multiple, so each partition should have about 400,000 records if 
partitioning by the hash of the key has low enough skew.

I'm wondering if my particular "reduceByKey.distinct" counts issue this has 
anything to do with zipPartitions or partitionBy, which are occurring before 
the reduceByKey.  I tried splitting these steps into a separate RDD that 
materializes before the reduceByKey is run, but my counts for every step are 
the same :-/

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10468) Verify schema before Dataframe select API call

2015-09-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10468:
--
Assignee: Vinod KC

> Verify schema before Dataframe select API call
> --
>
> Key: SPARK-10468
> URL: https://issues.apache.org/jira/browse/SPARK-10468
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Vinod KC
>Assignee: Vinod KC
>Priority: Minor
> Fix For: 1.6.0
>
>
> load method of GaussianMixtureModel and Word2VecModel  should verify the  
> schema before  dataframe.select(...) method.
> Currently , after dataframe.select(...) call, schema is verified. Need to 
> change order of method calls



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9717) Document persistence recommendation for MulticlassMetrics

2015-09-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9717:
-
Description: If a user wants to request multiple metrics from 
MulticlassMetrics, they should persist the RDD beforehand.  The docs should say 
as much.  (was: Add RDD persistence to MulticlassMetrics internals, following 
the example of BinaryClassificationMetrics.)

> Document persistence recommendation for MulticlassMetrics
> -
>
> Key: SPARK-9717
> URL: https://issues.apache.org/jira/browse/SPARK-9717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> If a user wants to request multiple metrics from MulticlassMetrics, they 
> should persist the RDD beforehand.  The docs should say as much.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-08 Thread Justin Uang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735749#comment-14735749
 ] 

Justin Uang commented on SPARK-8632:


I set the batch mode to be 100, which is the same as before (things will oom 
under the same conditions). We can be more clever in the future (add size 
estimation) or something similar, but I have a feeling that the overhead of 
batching isn't too great.

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10441) Cannot write timestamp to JSON

2015-09-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10441.
--
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8597
[https://github.com/apache/spark/pull/8597]

> Cannot write timestamp to JSON
> --
>
> Key: SPARK-10441
> URL: https://issues.apache.org/jira/browse/SPARK-10441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735644#comment-14735644
 ] 

Davies Liu commented on SPARK-8632:
---

The upstream means child of current SparkPlan, could have other Python UDFs. 

We remove the RDD cache in 1.4, then the upstream will be evaluated twice. If 
you have multiple Python UDFs, for example three, it will end up evaluate the 
child 8 times (2 x 2 x 2), which will be really slow or cause OOM.

In synchronous batch mode, what's the batch size? if it's small, the overhead 
of each batch will be high, if it's too large, it's easy to OOM if you have 
many columns. Also we need to copy the rows (serialization is not need if it's 
UnsafeRow).



> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10474) Aggregation failed with unable to acquire memory

2015-09-08 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10474:

Description: 
In aggregation case, a  Lost task happened with below error.

{code}
 java.io.IOException: Could not acquire 65536 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

Key SQL Query
{code:sql}
INSERT INTO TABLE test_table
SELECT
  ss.ss_customer_sk AS cid,
  count(CASE WHEN i.i_class_id=1  THEN 1 ELSE NULL END) AS id1,
  count(CASE WHEN i.i_class_id=3  THEN 1 ELSE NULL END) AS id3,
  count(CASE WHEN i.i_class_id=5  THEN 1 ELSE NULL END) AS id5,
  count(CASE WHEN i.i_class_id=7  THEN 1 ELSE NULL END) AS id7,
  count(CASE WHEN i.i_class_id=9  THEN 1 ELSE NULL END) AS id9,
  count(CASE WHEN i.i_class_id=11 THEN 1 ELSE NULL END) AS id11,
  count(CASE WHEN i.i_class_id=13 THEN 1 ELSE NULL END) AS id13,
  count(CASE WHEN i.i_class_id=15 THEN 1 ELSE NULL END) AS id15,
  count(CASE WHEN i.i_class_id=2  THEN 1 ELSE NULL END) AS id2,
  count(CASE WHEN i.i_class_id=4  THEN 1 ELSE NULL END) AS id4,
  count(CASE WHEN i.i_class_id=6  THEN 1 ELSE NULL END) AS id6,
  count(CASE WHEN i.i_class_id=8  THEN 1 ELSE NULL END) AS id8,
  count(CASE WHEN i.i_class_id=10 THEN 1 ELSE NULL END) AS id10,
  count(CASE WHEN i.i_class_id=14 THEN 1 ELSE NULL END) AS id14,
  count(CASE WHEN i.i_class_id=16 THEN 1 ELSE NULL END) AS id16
FROM store_sales ss
INNER JOIN item i ON ss.ss_item_sk = i.i_item_sk
WHERE i.i_category IN ('Books')
AND ss.ss_customer_sk IS NOT NULL
GROUP BY ss.ss_customer_sk
HAVING count(ss.ss_item_sk) > 5
{code}

Note:
the store_sales is a big fact table and item is a small dimension table.


  was:
In aggregation case, a  Lost task happened with below error.

 java.io.IOException: Could not acquire 65536 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.initializeForWriting(UnsafeExternalSorter.java:169)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:220)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:126)
at 
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:257)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:435)
at 

[jira] [Commented] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid

2015-09-08 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735710#comment-14735710
 ] 

Zhan Zhang commented on SPARK-10304:


Did more investigation. Currently all files are included (_common_metadata, 
etc), for example:
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/_common_metadata
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/_metadata
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/id=71/part-r-1-39ef2d6e-2832-4757-ac02-0a938eb83b7d.gz.parquet

On the framework level, the partition will be retrieved from both 
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/
and 
/Users/zzhang/repo/spark/sql/core/target/tmp/spark-cd6c0332-c6ed-4ef5-8061-7681b895e07a/id=71/

In this case, the framework cannot differentiate valid and invalid directory.
[~lian cheng] Can we filter all unnecessary files, e.g., _metadata, 
_common_metadata when doing the partition discovery by removing all files start 
with . or underscore? I didn't see such files are useful for partition 
discovery, but I may miss something. Otherwise, it seems to be hard to check 
the validation of the directory.
cc [~yhuai]


> Partition discovery does not throw an exception if the dir structure is 
> invalid
> ---
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>Priority: Critical
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-08 Thread Frank Rosner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735598#comment-14735598
 ] 

Frank Rosner commented on SPARK-10493:
--

Thanks for submitting the issue, [~glenn.strycker] :)

Can you provide a minimal example so we can try to reproduce the issue? It 
should also contain the submit command (or are you using the shell)?

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10482) Add Python interface for CountVectorizer

2015-09-08 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735639#comment-14735639
 ] 

holdenk commented on SPARK-10482:
-

This seems to duplicate https://issues.apache.org/jira/browse/SPARK-9769

> Add Python interface for CountVectorizer
> 
>
> Key: SPARK-10482
> URL: https://issues.apache.org/jira/browse/SPARK-10482
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>
> Add Python interface for ml.CountVectorizer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10442) select cast('false' as boolean) returns true

2015-09-08 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735668#comment-14735668
 ] 

Yin Huai commented on SPARK-10442:
--

[~lian cheng] Looks like postgresql support more string literals (see 
http://www.postgresql.org/docs/devel/static/datatype-boolean.html), e.g. 
{{yes}} and {{no}}.

> select cast('false' as boolean) returns true
> 
>
> Key: SPARK-10442
> URL: https://issues.apache.org/jira/browse/SPARK-10442
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10468) Verify schema before Dataframe select API call

2015-09-08 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10468.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8636
[https://github.com/apache/spark/pull/8636]

> Verify schema before Dataframe select API call
> --
>
> Key: SPARK-10468
> URL: https://issues.apache.org/jira/browse/SPARK-10468
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Vinod KC
>Priority: Minor
> Fix For: 1.6.0
>
>
> load method of GaussianMixtureModel and Word2VecModel  should verify the  
> schema before  dataframe.select(...) method.
> Currently , after dataframe.select(...) call, schema is verified. Need to 
> change order of method calls



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10492) Update Streaming documentation about rate limiting and backpressure

2015-09-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10492:
--
Affects Version/s: (was: 1.5.0)

> Update Streaming documentation about rate limiting and backpressure
> ---
>
> Key: SPARK-10492
> URL: https://issues.apache.org/jira/browse/SPARK-10492
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10492) Update Streaming documentation about rate limiting and backpressure

2015-09-08 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10492.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

> Update Streaming documentation about rate limiting and backpressure
> ---
>
> Key: SPARK-10492
> URL: https://issues.apache.org/jira/browse/SPARK-10492
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10441) Cannot write timestamp to JSON

2015-09-08 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735702#comment-14735702
 ] 

Yin Huai edited comment on SPARK-10441 at 9/8/15 10:00 PM:
---

[~dondrake] https://github.com/apache/spark/pull/8655 is the 1.5 branch 
backport of the fix. I just merged it into branch 1.5. 

Looks like this fix will not be in 1.5.0 (since the vote of releasing 1.5.0 
based on 1.5.0 RC3 just passed and RC3 does not have this fix). But, this fix 
will be in the 1.5.1 release.


was (Author: yhuai):
[~dondrake] https://github.com/apache/spark/pull/8655 is the 1.5 branch 
backport of the fix. I just merged it into branch 1.5. 

Looks like this fix will not be in 1.5.0 (since the vote of releasing 1.5.0 
based on 1.5.0 RC4 just passed and RC3 does not have this fix). But, this fix 
will be in the 1.5.1 release.

> Cannot write timestamp to JSON
> --
>
> Key: SPARK-10441
> URL: https://issues.apache.org/jira/browse/SPARK-10441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9717) Document persistence recommendation for MulticlassMetrics

2015-09-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9717:
-
Summary: Document persistence recommendation for MulticlassMetrics  (was: 
Add persistence to MulticlassMetrics)

> Document persistence recommendation for MulticlassMetrics
> -
>
> Key: SPARK-9717
> URL: https://issues.apache.org/jira/browse/SPARK-9717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Add RDD persistence to MulticlassMetrics internals, following the example of 
> BinaryClassificationMetrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9717) Document persistence recommendation for MulticlassMetrics

2015-09-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735726#comment-14735726
 ] 

Joseph K. Bradley commented on SPARK-9717:
--

True.  Changing this to document recommendation instead.

> Document persistence recommendation for MulticlassMetrics
> -
>
> Key: SPARK-9717
> URL: https://issues.apache.org/jira/browse/SPARK-9717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Add RDD persistence to MulticlassMetrics internals, following the example of 
> BinaryClassificationMetrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10494) Multiple Python UDFs together with aggregation or sort merge join may cause OOM (failed to acquire memory)

2015-09-08 Thread Davies Liu (JIRA)
Davies Liu created SPARK-10494:
--

 Summary: Multiple Python UDFs together with aggregation or sort 
merge join may cause OOM (failed to acquire memory)
 Key: SPARK-10494
 URL: https://issues.apache.org/jira/browse/SPARK-10494
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.0
Reporter: Davies Liu
Priority: Critical


The RDD cache for Python UDF is removed in 1.4, then N Python UDFs in one query 
stage could end up evaluate upstream (SparkPlan) 2^N times.

In 1.5, If there is aggregation or sort merge join in upstream SparkPlan, they 
will cause OOM (failed to acquire memory).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10467) Vector is converted to tuple when extracted from Row using __getitem__

2015-09-08 Thread Alexey Grishchenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734780#comment-14734780
 ] 

Alexey Grishchenko commented on SPARK-10467:


Issue is not reproduced on master:
{code}
>>> from pyspark.ml.feature import HashingTF
>>> df = sqlContext.createDataFrame([(["foo", "bar"], )], ("keys", ))
>>> transformer = HashingTF(inputCol="keys", outputCol="vec", numFeatures=5)
>>> transformed = transformer.transform(df)
>>> row = transformed.first()
>>> row.vec
SparseVector(5, {4: 2.0})
>>> row = Row(vec=Vectors.sparse(3, [(0, 1)]))
>>> df = sqlContext.createDataFrame([row], ("vec", ))
>>> df.first()[0]
SparseVector(3, {0: 1.0})
{code}

> Vector is converted to tuple when extracted from Row using __getitem__
> --
>
> Key: SPARK-10467
> URL: https://issues.apache.org/jira/browse/SPARK-10467
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark, SQL
>Affects Versions: 1.4.1
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> If we take a row from a data frame and try to extract vector element by index 
> it is converted to tuple:
> {code}
> from pyspark.ml.feature import HashingTF
> df = sqlContext.createDataFrame([(["foo", "bar"], )], ("keys", ))
> transformer = HashingTF(inputCol="keys", outputCol="vec", numFeatures=5)
> transformed = transformer.transform(df)
> row = transformed.first()
> row.vec # As expected
> ## SparseVector(5, {4: 2.0})
> row[1]  # Returns tuple
> ## (0, 5, [4], [2.0]) 
> {code}
> Problem cannot be reproduced if we create and access Row directly:
> {code}
> from pyspark.mllib.linalg import Vectors
> from pyspark.sql.types import Row
> row = Row(vec=Vectors.sparse(3, [(0, 1)]))
> row.vec
> ## SparseVector(3, {0: 1.0})
> row[0]
> ## SparseVector(3, {0: 1.0})
> {code}
> but if we use above to create a data frame and extract:
> {code}
> df = sqlContext.createDataFrame([row], ("vec", ))
> df.first()[0]
> ## (0, 3, [0], [1.0])  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10486) Spark intermittently fails to recover from a worker failure (in standalone mode)

2015-09-08 Thread Cheuk Lam (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheuk Lam updated SPARK-10486:
--
Description: 
We have run into a problem where some Spark job is aborted after one worker is 
killed in a 2-worker standalone cluster.  The problem is intermittent, but we 
can consistently reproduce it.  The problem only appears to happen when we kill 
a worker.  It doesn't seem to happen when we kill an executor directly.

The program we use to reproduce the problem is some iterative program based on 
GraphX, although the nature of the issue doesn't seem to be GraphX related.  
This is how we reproduce the problem:
* Set up a standalone cluster of 2 workers;
* Run a Spark application of some iterative program (ours is some based on 
GraphX);
* Kill a worker process (and thus the associated executor);
* Intermittently some job will be aborted.

The driver and the executor logs are available, as well as the application 
history (event log file).  But they are quite large and can't be attached here.

~

After looking into the log files, we think the failure is caused by the 
following two things combined:
* The BlockManagerMasterEndpoint in the driver has some stale block info 
corresponding to the dead executor after the worker has been killed.  The 
driver does appear to handle the "RemoveExecutor" message and cleans up all 
related block info.  But subsequently, and intermittently, it receives some 
Akka messages to re-register the dead BlockManager and re-add some of its 
blocks.  As a result, upon GetLocations requests from the remaining executor, 
the driver responds with some stale block info, instructing the remaining 
executor to fetch blocks from the dead executor.  Please see the driver log 
excerption below that shows the sequence of events described above.  In the 
log, there are two executors: 1.2.3.4 was the one which got shut down, while 
5.6.7.8 is the remaining executor.  The driver also ran on 5.6.7.8.
* When the remaining executor's BlockManager issues a doGetRemote() call to 
fetch the block of data, it fails because the targeted BlockManager which 
resided in the dead executor is gone.  This failure results in an exception 
forwarded to the caller, bypassing the mechanism in the doGetRemote() function 
to trigger a re-computation of the block.  I don't know whether that is 
intentional or not.

Driver log excerption that shows that the driver received messages to 
re-register the dead executor after handling the RemoveExecutor message:

11690 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message 
(172.236378 ms) 
AkkaMessage(RegisterExecutor(0,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/Executor#670388190]),1.2.3.4:36140,8,Map(stdout
 -> 
http://1.2.3.4:8081/logPage/?appId=app-20150902203512-=0=stdout,
 stderr -> 
http://1.2.3.4:8081/logPage/?appId=app-20150902203512-=0=stderr)),true)
 from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$f]

11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message 
AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)
 from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$g]

11717 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: Received RPC message: 
AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)

11718 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] INFO 
BlockManagerMasterEndpoint: Registering block manager 1.2.3.4:52615 with 6.2 GB 
RAM, BlockManagerId(0, 1.2.3.4, 52615)

11719 15/09/02 20:35:16 [sparkDriver-akka.actor.default-dispatcher-15] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] handled message 
(1.498313 ms) AkkaMessage(RegisterBlockManager(BlockManagerId(0, 1.2.3.4, 
52615),6667936727,AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/user/BlockManagerEndpoint1#-21635])),true)
 from Actor[akka.tcp://sparkExecutor@1.2.3.4:36140/temp/$g]

...

308892 15/09/02 20:40:13 [sparkDriver-akka.actor.default-dispatcher-3] ERROR 
TaskSchedulerImpl: Lost executor 0 on 1.2.3.4: worker lost

...

308903 15/09/02 20:40:13 [dag-scheduler-event-loop] INFO DAGScheduler: Executor 
lost: 0 (epoch 178)

308904 15/09/02 20:40:13 [sparkDriver-akka.actor.default-dispatcher-3] DEBUG 
AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1: [actor] received message 
AkkaMessage(RemoveExecutor(0),true) from 

[jira] [Comment Edited] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB

2015-09-08 Thread Rustam Aliyev (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734830#comment-14734830
 ] 

Rustam Aliyev edited comment on SPARK-6101 at 9/8/15 1:58 PM:
--

What's the status of this? GH repo has not been updated for a while.

Few improvements which I'd like to see:
# Use {{FilterExpression}} instead of legacy {{ScanFilter}} for {{scan}} 
operation
# Leverage Parallel Scan 
(http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#QueryAndScanParallelScan)
 and make multiple Spark workers pull data in parallel.


was (Author: rstml):
What's the status of this? GH repo has not been updated for a while.

Few improvements which I'd like to see:
# Use {FilterExpression} instead of legacy {ScanFilter} in the {scan}
# Leverage Parallel Scan 
(http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#QueryAndScanParallelScan)
 and make multiple Spark workers pull data in parallel.

> Create a SparkSQL DataSource API implementation for DynamoDB
> 
>
> Key: SPARK-6101
> URL: https://issues.apache.org/jira/browse/SPARK-6101
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Chris Fregly
>Assignee: Chris Fregly
> Fix For: 1.6.0
>
>
> similar to https://github.com/databricks/spark-avro  and 
> https://github.com/databricks/spark-csv
> Here's a good basis for a java-based, high-level dynamodb java connector:  
> https://github.com/sporcina/dynamodb-connector/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-09-08 Thread Martin Tapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14734792#comment-14734792
 ] 

Martin Tapp commented on SPARK-4940:


I see your point and thinking about it, round-robin is excellent as long as 
`spark.executor.cores` gets implemented for Mesos. The root of my problem is to 
reserve cores for threads that are outside Spark's control. I don't think 
there's an issue opened for this though (spark.executor.cores support for 
Mesos). What do you think?

> Support more evenly distributing cores for Mesos mode
> -
>
> Key: SPARK-4940
> URL: https://issues.apache.org/jira/browse/SPARK-4940
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
> Attachments: mesos-config-difference-3nodes-vs-2nodes.png
>
>
> Currently in Coarse grain mode the spark scheduler simply takes all the 
> resources it can on each node, but can cause uneven distribution based on 
> resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10493) reduceByKey not returning distinct results

2015-09-08 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735772#comment-14735772
 ] 

Sean Owen commented on SPARK-10493:
---

There are some key pieces of info missing, like what the key and value types 
are. This could happen if they do not implement equals/hashCode correctly. It 
could also happen if your input is nondeterministic. Without a reproducible 
example with these details I don't think this is an actionable JIRA.

> reduceByKey not returning distinct results
> --
>
> Key: SPARK-10493
> URL: https://issues.apache.org/jira/browse/SPARK-10493
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Glenn Strycker
>
> I am running Spark 1.3.0 and creating an RDD by unioning several earlier RDDs 
> (using zipPartitions), partitioning by a hash partitioner, and then applying 
> a reduceByKey to summarize statistics by key.
> Since my set before the reduceByKey consists of records such as (K, V1), (K, 
> V2), (K, V3), I expect the results after reduceByKey to be just (K, 
> f(V1,V2,V3)), where the function f is appropriately associative, commutative, 
> etc.  Therefore, the results after reduceByKey ought to be distinct, correct? 
>  I am running counts of my RDD and finding that adding an additional 
> .distinct after my .reduceByKey is changing the final count!!
> Here is some example code:
> rdd3 = tempRDD1.
>zipPartitions(tempRDD2, true)((iter, iter2) => iter++iter2).
>partitionBy(new HashPartitioner(numPartitions)).
>reduceByKey((a,b) => (math.Ordering.String.min(a._1, b._1), a._2 + b._2, 
> math.max(a._3, b._3), math.max(a._4, b._4), math.max(a._5, b._5)))
> println(rdd3.count)
> rdd4 = rdd3.distinct
> println(rdd4.count)
> I am using persistence, checkpointing, and other stuff in my actual code that 
> I did not paste here, so I can paste my actual code if it would be helpful.
> This issue may be related to SPARK-2620, except I am not using case classes, 
> to my knowledge.
> See also 
> http://stackoverflow.com/questions/32466176/apache-spark-rdd-reducebykey-operation-not-returning-correct-distinct-results



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10466) UnsafeRow exception in Sort-Based Shuffle with data spill

2015-09-08 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735776#comment-14735776
 ] 

Davies Liu commented on SPARK-10466:


[~chenghao] I tried your test case, it passed in master. Is there other things 
I need to reproduce the failure?

> UnsafeRow exception in Sort-Based Shuffle with data spill 
> --
>
> Key: SPARK-10466
> URL: https://issues.apache.org/jira/browse/SPARK-10466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Blocker
>
> In sort-based shuffle, if we have data spill, it will cause assert exception, 
> the follow code can reproduce that
> {code}
> withSparkConf(("spark.shuffle.sort.bypassMergeThreshold", "2")) {
>   withSQLConf(("spark.sql.autoBroadcastJoinThreshold", "0")) {
> withTempTable("mytemp") {
>   sparkContext.parallelize(1 to 1000, 3).map(i => (i, 
> i)).toDF("key", "value").registerTempTable("mytemp")
>   sql("select key, value as v1 from mytemp where key > 
> 1").registerTempTable("l")
>   sql("select key, value as v2 from mytemp where key > 
> 3").registerTempTable("r")
>   val df3 = sql("select v1, v2 from l left join r on l.key=r.key")
>   df3.count()
> }
>   }
> }
> {code}
> {code}
> java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>   at java.lang.Thread.run(Thread.java:722)
> 17:32:06.172 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 1.0 (TID 4, localhost): java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
>   at 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> 

[jira] [Commented] (SPARK-10433) Gradient boosted trees

2015-09-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735800#comment-14735800
 ] 

Joseph K. Bradley commented on SPARK-10433:
---

I had seen the input size growing, but I missed the growth in number of records 
when I first glanced at this JIRA.  That's really strange.  I'll try to 
reproduce it in case it's a Spark Core bug.  It may have been inadvertently 
fixed by my caching/checkpointing patch in 1.5.

> Gradient boosted trees
> --
>
> Key: SPARK-10433
> URL: https://issues.apache.org/jira/browse/SPARK-10433
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Sean Owen
>
> (Sorry to say I don't have any leads on a fix, but this was reported by three 
> different people and I confirmed it at fairly close range, so think it's 
> legitimate:)
> This is probably best explained in the words from the mailing list thread at 
> http://mail-archives.apache.org/mod_mbox/spark-user/201509.mbox/%3C55E84380.2000408%40gmail.com%3E
>  . Matt Forbes says:
> {quote}
> I am training a boosted trees model on a couple million input samples (with 
> around 300 features) and am noticing that the input size of each stage is 
> increasing each iteration. For each new tree, the first step seems to be 
> building the decision tree metadata, which does a .count() on the input data, 
> so this is the step I've been using to track the input size changing. Here is 
> what I'm seeing: 
> {quote}
> {code}
> count at DecisionTreeMetadata.scala:111 
> 1. Input Size / Records: 726.1 MB / 1295620 
> 2. Input Size / Records: 106.9 GB / 64780816 
> 3. Input Size / Records: 160.3 GB / 97171224 
> 4. Input Size / Records: 214.8 GB / 129680959 
> 5. Input Size / Records: 268.5 GB / 162533424 
>  
> Input Size / Records: 1912.6 GB / 1382017686 
>  
> {code}
> {quote}
> This step goes from taking less than 10s up to 5 minutes by the 15th or so 
> iteration. I'm not quite sure what could be causing this. I am passing a 
> memory-only cached RDD[LabeledPoint] to GradientBoostedTrees.train 
> {quote}
> Johannes Bauer showed me a very similar problem.
> Peter Rudenko offers this sketch of a reproduction:
> {code}
> val boostingStrategy = BoostingStrategy.defaultParams("Classification")
> boostingStrategy.setNumIterations(30)
> boostingStrategy.setLearningRate(1.0)
> boostingStrategy.treeStrategy.setMaxDepth(3)
> boostingStrategy.treeStrategy.setMaxBins(128)
> boostingStrategy.treeStrategy.setSubsamplingRate(1.0)
> boostingStrategy.treeStrategy.setMinInstancesPerNode(1)
> boostingStrategy.treeStrategy.setUseNodeIdCache(true)
> boostingStrategy.treeStrategy.setCategoricalFeaturesInfo(
>   
> mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer,
>  java.lang.Integer]])
> val model = GradientBoostedTrees.train(instances, boostingStrategy)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10327) Cache Table is not working while subquery has alias in its project list

2015-09-08 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10327.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8494
[https://github.com/apache/spark/pull/8494]

> Cache Table is not working while subquery has alias in its project list
> ---
>
> Key: SPARK-10327
> URL: https://issues.apache.org/jira/browse/SPARK-10327
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
> Fix For: 1.6.0
>
>
> Code to reproduce that:
> {code}
> import org.apache.spark.sql.hive.execution.HiveTableScan
> sql("select key, value, key + 1 from src").registerTempTable("abc")
> cacheTable("abc")
> val sparkPlan = sql(
>   """select a.key, b.key, c.key from
> |abc a join abc b on a.key=b.key
> |join abc c on a.key=c.key""".stripMargin).queryExecution.sparkPlan
> assert(sparkPlan.collect { case e: InMemoryColumnarTableScan => e }.size 
> === 3) // failed
> assert(sparkPlan.collect { case e: HiveTableScan => e }.size === 0) // 
> failed
> {code}
> The query plan like:
> {code}
> == Parsed Logical Plan ==
> 'Project 
> [unresolvedalias('a.key),unresolvedalias('b.key),unresolvedalias('c.key)]
>  'Join Inner, Some(('a.key = 'c.key))
>   'Join Inner, Some(('a.key = 'b.key))
>'UnresolvedRelation [abc], Some(a)
>'UnresolvedRelation [abc], Some(b)
>   'UnresolvedRelation [abc], Some(c)
> == Analyzed Logical Plan ==
> key: int, key: int, key: int
> Project [key#14,key#61,key#66]
>  Join Inner, Some((key#14 = key#66))
>   Join Inner, Some((key#14 = key#61))
>Subquery a
> Subquery abc
>  Project [key#14,value#15,(key#14 + 1) AS _c2#16]
>   MetastoreRelation default, src, None
>Subquery b
> Subquery abc
>  Project [key#61,value#62,(key#61 + 1) AS _c2#58]
>   MetastoreRelation default, src, None
>   Subquery c
>Subquery abc
> Project [key#66,value#67,(key#66 + 1) AS _c2#63]
>  MetastoreRelation default, src, None
> == Optimized Logical Plan ==
> Project [key#14,key#61,key#66]
>  Join Inner, Some((key#14 = key#66))
>   Project [key#14,key#61]
>Join Inner, Some((key#14 = key#61))
> Project [key#14]
>  InMemoryRelation [key#14,value#15,_c2#16], true, 1, 
> StorageLevel(true, true, false, true, 1), (Project [key#14,value#15,(key#14 + 
> 1) AS _c2#16]), Some(abc)
> Project [key#61]
>  MetastoreRelation default, src, None
>   Project [key#66]
>MetastoreRelation default, src, None
> == Physical Plan ==
> TungstenProject [key#14,key#61,key#66]
>  BroadcastHashJoin [key#14], [key#66], BuildRight
>   TungstenProject [key#14,key#61]
>BroadcastHashJoin [key#14], [key#61], BuildRight
> ConvertToUnsafe
>  InMemoryColumnarTableScan [key#14], (InMemoryRelation 
> [key#14,value#15,_c2#16], true, 1, StorageLevel(true, true, false, true, 
> 1), (Project [key#14,value#15,(key#14 + 1) AS _c2#16]), Some(abc))
> ConvertToUnsafe
>  HiveTableScan [key#61], (MetastoreRelation default, src, None)
>   ConvertToUnsafe
>HiveTableScan [key#66], (MetastoreRelation default, src, None)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >