[jira] [Commented] (SPARK-9865) Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame

2015-10-22 Thread Weiqiang Zhuang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969536#comment-14969536
 ] 

Weiqiang Zhuang commented on SPARK-9865:


Just FYI, I also got this test failure a couple of time.

> Flaky SparkR test: test_sparkSQL.R: sample on a DataFrame
> -
>
> Key: SPARK-9865
> URL: https://issues.apache.org/jira/browse/SPARK-9865
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Davies Liu
>
> 1. Failure (at test_sparkSQL.R#525): sample on a DataFrame 
> -
> count(sampled3) < 3 isn't true
> Error: Test failures
> Execution halted
> https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1468/console



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10562) .partitionBy() creates the metastore partition columns in all lowercase, but persists the data path as MixedCase resulting in an error when the data is later attempted t

2015-10-22 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10562:
-
Assignee: Wenchen Fan

> .partitionBy() creates the metastore partition columns in all lowercase, but 
> persists the data path as MixedCase resulting in an error when the data is 
> later attempted to query.
> -
>
> Key: SPARK-10562
> URL: https://issues.apache.org/jira/browse/SPARK-10562
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Jason Pohl
>Assignee: Wenchen Fan
> Attachments: MixedCasePartitionBy.dbc
>
>
> When using DataFrame.write.partitionBy().saveAsTable() it creates the 
> partiton by columns in all lowercase in the meta-store.  However, it writes 
> the data to the filesystem using mixed-case.
> This causes an error when running a select against the table.
> --
> from pyspark.sql import Row
> # Create a data frame with mixed case column names
> myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015),
>Row(Name="Frank Lampard", Goals=15, Year=2012)])
> myDF = sqlContext.createDataFrame(myRDD)
> # Write this data out to a parquet file and partition by the Year (which is a 
> mixedCase name)
> myDF.write.partitionBy("Year").saveAsTable("chelsea_goals")
> %sql show create table chelsea_goals;
> --The metastore is showwing a partition column name of all lowercase "year"
> # Verify that the data is written with appropriate partitions
> display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals"))
> %sql
> --Now try to run a query against this table
> select * from chelsea_goals
> Error in SQL statement: UncheckedExecutionException: 
> java.lang.RuntimeException: Partition column year not found in schema 
> StructType(StructField(Goals,LongType,true), 
> StructField(Name,StringType,true), StructField(Year,LongType,true))
> # Now lets try this again using a lowercase column name
> myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015),
>  Row(Name="Frank Lampard", Goals=15, year=2012)])
> myDF2 = sqlContext.createDataFrame(myRDD2)
> myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2")
> %sql select * from chelsea_goals2;
> --Now everything works



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11232) NettyRpcEndpointRef.send should not be interrupted

2015-10-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11232.

Resolution: Fixed
  Assignee: Shixiong Zhu

> NettyRpcEndpointRef.send should not be interrupted
> --
>
> Key: SPARK-11232
> URL: https://issues.apache.org/jira/browse/SPARK-11232
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> The current NettyRpcEndpointRef.send can be interrupted because it uses 
> `LinkedBlockingQueue.put`, which may hang the application. E.g., 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44062/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11264) ./bin/spark-class can't find assembly jars with certain GREP_OPTIONS set

2015-10-22 Thread Jeffrey Naisbitt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Naisbitt updated SPARK-11264:
-
Component/s: Spark Shell

> ./bin/spark-class can't find assembly jars with certain GREP_OPTIONS set
> 
>
> Key: SPARK-11264
> URL: https://issues.apache.org/jira/browse/SPARK-11264
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.1
>Reporter: Jeffrey Naisbitt
>Priority: Minor
>
> Some GREP_OPTIONS will modify the output of the grep commands that are 
> looking for the assembly jars in bin/spark-class.
> For example, if the -n option is specified, the grep output will look like: 
> {code}5:spark-assembly-1.5.1-hadoop2.4.0.jar{code}
> This will not match the regular expressions, and so the jar files will not be 
> found.  We could improve the regular expression to handle cases like this and 
> trim off extra characters, but it is difficult to know which options may or 
> may not be set.  Unsetting GREP_OPTIONS within the script handles all the 
> cases and gives the desired output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11260) Add 'with' API

2015-10-22 Thread Weiqiang Zhuang (JIRA)
Weiqiang Zhuang created SPARK-11260:
---

 Summary: Add 'with' API
 Key: SPARK-11260
 URL: https://issues.apache.org/jira/browse/SPARK-11260
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Weiqiang Zhuang
Priority: Minor


R has the with() API to evaluate an expression in a data environment. This jira 
is to implement the similar with() API for the DataFrame. It will simplify the 
calls to the functions in the expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11245) Upgrade twitter4j to version 4.x

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11245:


Assignee: (was: Apache Spark)

> Upgrade twitter4j to version 4.x
> 
>
> Key: SPARK-11245
> URL: https://issues.apache.org/jira/browse/SPARK-11245
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Luciano Resende
>
> Twitter4J is already on 4.x release
> https://github.com/yusuke/twitter4j/releases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11245) Upgrade twitter4j to version 4.x

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11245:


Assignee: Apache Spark

> Upgrade twitter4j to version 4.x
> 
>
> Key: SPARK-11245
> URL: https://issues.apache.org/jira/browse/SPARK-11245
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Luciano Resende
>Assignee: Apache Spark
>
> Twitter4J is already on 4.x release
> https://github.com/yusuke/twitter4j/releases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11260) Add 'with' API

2015-10-22 Thread Weiqiang Zhuang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969528#comment-14969528
 ] 

Weiqiang Zhuang commented on SPARK-11260:
-

I will be working on this.

> Add 'with' API
> --
>
> Key: SPARK-11260
> URL: https://issues.apache.org/jira/browse/SPARK-11260
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Weiqiang Zhuang
>Priority: Minor
>
> R has the with() API to evaluate an expression in a data environment. This 
> jira is to implement the similar with() API for the DataFrame. It will 
> simplify the calls to the functions in the expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11261) Provide a more flexible alternative to Jdbc RDD

2015-10-22 Thread Richard Marscher (JIRA)
Richard Marscher created SPARK-11261:


 Summary: Provide a more flexible alternative to Jdbc RDD
 Key: SPARK-11261
 URL: https://issues.apache.org/jira/browse/SPARK-11261
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Richard Marscher


The existing JdbcRDD only covers a limited number of use cases by requiring the 
semantics of your query to operate on upper and lower bound predicates like: 
"select title, author from books where ? <= id and id <= ?"

However, there are many use cases that cannot use such a method and/or are much 
more inefficient doing so.

For example, we have a MySQL table partitioned on a partition key. We don't 
have range values to lookup but rather want to get all entries matching a 
predicate and have Spark run 1 query in a partition against each logical 
partition of our MySQL table. For example: "select * from devices where 
partition_id = ? and app_id = 'abcd'".

Another use case, looking up against a distinct set of identifiers that don't 
fall within an ordering. "select * from users where user_id in 
(?,?,?,?,?,?,?)". The number of identifiers may be quite large and/or dynamic.

Solution:
Instead of addressing each use case differently with new RDD types, provide an 
alternate, general RDD that gives the user direct control over how the query is 
partitioned in Spark and filling in the placeholders.

The user should be able to control which placeholder values are available on 
each partition of the RDD and also how they are inserted into the 
PreparedStatement. Ideally it can support dynamic placeholder values like 
inserting a set of values for an IN clause or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11163) Remove unnecessary addPendingTask calls in TaskSetManager.executorLost

2015-10-22 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-11163.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9154
[https://github.com/apache/spark/pull/9154]

> Remove unnecessary addPendingTask calls in TaskSetManager.executorLost
> --
>
> Key: SPARK-11163
> URL: https://issues.apache.org/jira/browse/SPARK-11163
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Kay Ousterhout
>Assignee: Kay Ousterhout
>Priority: Minor
> Fix For: 1.6.0
>
>
> The proposed commit removes unnecessary calls to addPendingTask in
> TaskSetManager.executorLost. These calls are unnecessary: for
> tasks that are still pending and haven't been launched, they're
> still in all of the correct pending lists, so calling addPendingTask
> has no effect. For tasks that are currently running (which may still be
> in the pending lists, depending on how they were scheduled), we call
> addPendingTask in handleFailedTask, so the calls at the beginning
> of executorLost are redundant.
> I think these calls are left over from when we re-computed the locality
> levels in addPendingTask; now that we call recomputeLocality separately,
> I don't think these are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5966) Spark-submit deploy-mode incorrectly affecting submission when master = local[4]

2015-10-22 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968711#comment-14968711
 ] 

kevin yu commented on SPARK-5966:
-

Hello Tathagata & Andrew:
I have code a possible fix, and the error message will be like this.
$ ./bin/spark-submit --master local[10] --deploy-mode cluster 
examples/src/main/python/pi.py
Error: Cluster deploy mode is not compatible with master "local"
Run with --help for usage help or --verbose for debug output

Let me know if you have any comments. Otherwise, I am going to submit a PR 
shortly. Thanks.

Kevin


> Spark-submit deploy-mode incorrectly affecting submission when master = 
> local[4] 
> -
>
> Key: SPARK-5966
> URL: https://issues.apache.org/jira/browse/SPARK-5966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>
> {code}
> [tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
> test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
> App arguments:
> Usage: MemoryTest  
> [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster 
>  --class test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> java.lang.ClassNotFoundException:
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5966) Spark-submit deploy-mode incorrectly affecting submission when master = local[4]

2015-10-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968739#comment-14968739
 ] 

Apache Spark commented on SPARK-5966:
-

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/9220

> Spark-submit deploy-mode incorrectly affecting submission when master = 
> local[4] 
> -
>
> Key: SPARK-5966
> URL: https://issues.apache.org/jira/browse/SPARK-5966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>
> {code}
> [tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
> test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
> App arguments:
> Usage: MemoryTest  
> [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster 
>  --class test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> java.lang.ClassNotFoundException:
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11258) Remove quadratic runtime complexity for converting a Spark DataFrame into an R data.frame

2015-10-22 Thread Frank Rosner (JIRA)
Frank Rosner created SPARK-11258:


 Summary: Remove quadratic runtime complexity for converting a 
Spark DataFrame into an R data.frame
 Key: SPARK-11258
 URL: https://issues.apache.org/jira/browse/SPARK-11258
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.5.1
Reporter: Frank Rosner


h4. Introduction

We tried to collect a DataFrame with > 1 million rows and a few hundred columns 
in SparkR. This took a huge amount of time (much more than in the Spark REPL). 
When looking into the code, I found that the 
{{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run time 
complexity (it goes through the complete data set _m_ times, where _m_ is the 
number of columns.

h4. Problem

The {{dfToCols}} method is transposing the row-wise representation of the Spark 
DataFrame (array of rows) into a column wise representation (array of columns) 
to then be put into a data frame. This is done in a very inefficient way, 
yielding to huge performance (and possibly also memory) problems when 
collecting bigger data frames.

h4. Solution

Directly transpose the row wise representation to the column wise 
representation with one pass through the data. I will create a pull request for 
this.

h4. Runtime comparison

On a test data frame with 1 million rows and 22 columns, the old `dfToCols` 
method takes average 2267 ms to complete. My implementation takes only 554 ms 
on average. This effect gets even bigger, the more columns you have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9523) Receiver for Spark Streaming does not naturally support kryo serializer

2015-10-22 Thread Yuhang Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968628#comment-14968628
 ] 

Yuhang Chen commented on SPARK-9523:


So you mean closures also support kryo? But I never add any kryo codes to them 
and they worked just fine when kryo serializer was set in SparkConf, while the 
receivers didn't. I got confused by that.

> Receiver for Spark Streaming does not naturally support kryo serializer
> ---
>
> Key: SPARK-9523
> URL: https://issues.apache.org/jira/browse/SPARK-9523
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.1
> Environment: Windows 7 local mode
>Reporter: Yuhang Chen
>Priority: Minor
>  Labels: kryo, serialization
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> In some cases, some attributes in a class is not serializable, which you 
> still want to use after serialization of the whole object, you'll have to 
> customize your serialization codes. For example, you can declare those 
> attributes as transient, which makes them ignored during serialization, and 
> then you can reassign their values during deserialization.
> Now, if you're using Java serialization, you'll have to implement 
> Serializable, and write those codes in readObject() and writeObejct() 
> methods; And if you're using kryo serialization, you'll have to implement 
> KryoSerializable, and write these codes in read() and write() methods.
> In Spark and Spark Streaming, you can set kryo as the serializer for speeding 
> up. However, the functions taken by RDD or DStream operations are still 
> serialized by Java serialization, which means you only need to write those 
> custom serialization codes in readObject() and writeObejct() methods.
> But when it comes to Spark Streaming's Receiver, things are different. When 
> you wish to customize an InputDStream, you must extend the Receiver. However, 
> it turns out, the Receiver will be serialized by kryo if you set kryo 
> serializer in SparkConf, and will fall back to Java serialization if you 
> didn't.
> So here's comes the problems, if you want to change the serializer by 
> configuration and make sure the Receiver runs perfectly for both Java and 
> kryo, you'll have to write all the 4 methods above. First, it is redundant, 
> since you'll have to write serialization/deserialization code almost twice; 
> Secondly, there's nothing in the doc or in the code to inform users to 
> implement the KryoSerializable interface. 
> Since all other function parameters are serialized by Java only, I suggest 
> you also make it so for the Receiver. It may be slower, but since the 
> serialization will only be executed for each interval, it's durable. More 
> importantly, it can cause fewer trouble



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11255) R Test build should run on R 3.1.1

2015-10-22 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-11255:


 Summary: R Test build should run on R 3.1.1
 Key: SPARK-11255
 URL: https://issues.apache.org/jira/browse/SPARK-11255
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Felix Cheung
Priority: Minor


Test should run on R 3.1.1 which is the version listed as supported.
Apparently there are few R changes that can go undetected since Jenkins Test 
build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11245) Upgrade twitter4j to version 4.x

2015-10-22 Thread pronix (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968744#comment-14968744
 ] 

pronix commented on SPARK-11245:


https://github.com/apache/spark/pull/9221

> Upgrade twitter4j to version 4.x
> 
>
> Key: SPARK-11245
> URL: https://issues.apache.org/jira/browse/SPARK-11245
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Luciano Resende
> Fix For: 1.6.0
>
>
> Twitter4J is already on 4.x release
> https://github.com/yusuke/twitter4j/releases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11231) join returns schema with duplicated and ambiguous join columns

2015-10-22 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968608#comment-14968608
 ] 

Sun Rui commented on SPARK-11231:
-

[~shivaram] This feature can be done on R side even though it is supported on 
Scala side. But if Scala side can support this, R implementation can be 
simplified.

> join returns schema with duplicated and ambiguous join columns
> --
>
> Key: SPARK-11231
> URL: https://issues.apache.org/jira/browse/SPARK-11231
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: R
>Reporter: Matt Pollock
>
> In the case where the key column of two data frames are named the same thing, 
> join returns a data frame where that column is duplicated. Since the content 
> of the columns is guaranteed to be the same by row consolidating the 
> identical columns into a single column would replicate standard R behavior[1] 
> and help prevent ambiguous names.
> Example:
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > sdf1 <- createDataFrame(sqlContext, df1)
> > sdf2 <- createDataFrame(sqlContext, df2)
> > sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner")
> > schema(sjdf)
> StructType
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value1", type = "DoubleType", nullable = TRUE
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value2", type = "DoubleType", nullable = TRUE
> {code}
> The duplicated key columns cause things like:
> {code}
> > library(magrittr)
> > sjdf %>% select("key")
> 15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could 
> be: key#125, key#127.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399)
>   at org.apache.spark.sql.catalyst.tree
> {code}
> [1] In base R there is no"join", but a similar function "merge" is provided 
> in which a "by" argument identifies the shared key column in the two data 
> frames. In the case where the key column names differ "by.x" and "by.y" 
> arguments can be used. In the case of same-named key columns the 
> consolidation behavior requested above is observed. In the case of differing 
> names they "by.x" name is retained and consolidated with the "by.y" column 
> which is dropped.
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > merge(df1, df2, by="key")
>   key value1 value2
> 1   A  1  4
> 2   B  2  5
> 3   C  3  6
> df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3))
> > merge(df2, df3, by.x="key", by.y="akey")
>   key value2 value1
> 1   A  4  1
> 2   B  5  2
> 3   C  6  3
> > merge(df3, df2, by.x="akey", by.y="key")
>   akey value1 value2
> 1A  1  4
> 2B  2  5
> 3C  3  6
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11231) join returns schema with duplicated and ambiguous join columns

2015-10-22 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968605#comment-14968605
 ] 

Sun Rui commented on SPARK-11231:
-

[~mpollock], we are implement R-like merge in sparkR 
(https://github.com/apache/spark/pull/9012), could you take a look at it and 
give some feedback? Also there is a JIRA submitted to request such feature in 
Spark core (https://issues.apache.org/jira/browse/SPARK-11250)

> join returns schema with duplicated and ambiguous join columns
> --
>
> Key: SPARK-11231
> URL: https://issues.apache.org/jira/browse/SPARK-11231
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: R
>Reporter: Matt Pollock
>
> In the case where the key column of two data frames are named the same thing, 
> join returns a data frame where that column is duplicated. Since the content 
> of the columns is guaranteed to be the same by row consolidating the 
> identical columns into a single column would replicate standard R behavior[1] 
> and help prevent ambiguous names.
> Example:
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > sdf1 <- createDataFrame(sqlContext, df1)
> > sdf2 <- createDataFrame(sqlContext, df2)
> > sjdf <- join(sdf1, sdf2, sdf1$key == sdf2$key, "inner")
> > schema(sjdf)
> StructType
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value1", type = "DoubleType", nullable = TRUE
> |-name = "key", type = "StringType", nullable = TRUE
> |-name = "value2", type = "DoubleType", nullable = TRUE
> {code}
> The duplicated key columns cause things like:
> {code}
> > library(magrittr)
> > sjdf %>% select("key")
> 15/10/21 11:04:28 ERROR r.RBackendHandler: select on 1414 failed
> Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
>   org.apache.spark.sql.AnalysisException: Reference 'key' is ambiguous, could 
> be: key#125, key#127.;
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:278)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:162)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$20.apply(Analyzer.scala:403)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:403)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:399)
>   at org.apache.spark.sql.catalyst.tree
> {code}
> [1] In base R there is no"join", but a similar function "merge" is provided 
> in which a "by" argument identifies the shared key column in the two data 
> frames. In the case where the key column names differ "by.x" and "by.y" 
> arguments can be used. In the case of same-named key columns the 
> consolidation behavior requested above is observed. In the case of differing 
> names they "by.x" name is retained and consolidated with the "by.y" column 
> which is dropped.
> {code}
> > df1 <- data.frame(key=c("A", "B", "C"), value1=c(1, 2, 3))
> > df2 <- data.frame(key=c("A", "B", "C"), value2=c(4, 5, 6))
> > merge(df1, df2, by="key")
>   key value1 value2
> 1   A  1  4
> 2   B  2  5
> 3   C  3  6
> df3 <- data.frame(akey=c("A", "B", "C"), value1=c(1, 2, 3))
> > merge(df2, df3, by.x="key", by.y="akey")
>   key value2 value1
> 1   A  4  1
> 2   B  5  2
> 3   C  6  3
> > merge(df3, df2, by.x="akey", by.y="key")
>   akey value1 value2
> 1A  1  4
> 2B  2  5
> 3C  3  6
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11181) Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled.

2015-10-22 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968648#comment-14968648
 ] 

Saisai Shao commented on SPARK-11181:
-

OK, get it.

> Spark Yarn : Spark reducing total executors count even when Dynamic 
> Allocation is disabled.
> ---
>
> Key: SPARK-11181
> URL: https://issues.apache.org/jira/browse/SPARK-11181
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, YARN
>Affects Versions: 1.3.1
> Environment: Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. 
> All servers in cluster running Linux version 2.6.32. 
> Job in yarn-client mode.
>Reporter: prakhar jauhari
>
> Spark driver reduces total executors count even when Dynamic Allocation is 
> not enabled.
> To reproduce this:
> 1. A 2 node yarn setup : each DN has ~ 20GB mem and 4 cores.
> 2. When the application launches and gets it required executors, One of the 
> DN's losses connectivity and is timed out.
> 3. Spark issues a killExecutor for the executor on the DN which was timed 
> out. 
> 4. Even with dynamic allocation off, spark's scheduler reduces the 
> "targetNumExecutors".
> 5. Thus the job runs with reduced executor count.
> Note : The severity of the issue increases : If some of the DN that were 
> running my job's executors lose connectivity intermittently, spark scheduler 
> reduces "targetNumExecutors", thus not asking for new executors on any other 
> nodes, causing the job to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11214) Join with Unicode-String results wrong empty

2015-10-22 Thread Hans Fischer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Fischer reopened SPARK-11214:
--

I have updated to Spark 1.5.1 and still get an empty result set. How could this 
be?

> Join with Unicode-String results wrong empty
> 
>
> Key: SPARK-11214
> URL: https://issues.apache.org/jira/browse/SPARK-11214
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Hans Fischer
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.5.1
>
>
> I created a join that should clearly result in a single row but return: 
> empty. Could someone validate this bug?
> hiveContext.sql('SELECT * FROM (SELECT "c" AS a) AS a JOIN (SELECT "c" AS b) 
> AS b ON a.a = b.b').take(10)
> result: []
> kind regards
> Hans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11257) Spark dataframe negate filter conditions

2015-10-22 Thread Lokesh Kumar (JIRA)
Lokesh Kumar created SPARK-11257:


 Summary: Spark dataframe negate filter conditions
 Key: SPARK-11257
 URL: https://issues.apache.org/jira/browse/SPARK-11257
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
 Environment: Fedora 21 core i5
Reporter: Lokesh Kumar
 Fix For: 1.5.0


I am trying to apply a negation of filter condition on the DataFrame as shown 
below.

!(`Ship Mode` LIKE '%Truck%')
Which is throwing an exception below

Exception in thread "main" java.lang.RuntimeException: [1.3] failure: 
identifier expected

(!(`Ship Mode` LIKE '%Truck%'))
  ^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:47)
at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:748)
at Main.main(Main.java:73)
Where as the same kind of negative filter conditions are working fine in MySQL. 
Please find below

mysql> select count(*) from audit_log where !(operation like '%Log%' or 
operation like '%Proj%');
+--+
| count(*) |
+--+
|  129 |
+--+
1 row in set (0.05 sec)
Can anyone please let me know if this is planned to be fixed in Spark 
DataFrames in future releases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11258) Remove quadratic runtime complexity for converting a Spark DataFrame into an R data.frame

2015-10-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968814#comment-14968814
 ] 

Apache Spark commented on SPARK-11258:
--

User 'FRosner' has created a pull request for this issue:
https://github.com/apache/spark/pull/9222

> Remove quadratic runtime complexity for converting a Spark DataFrame into an 
> R data.frame
> -
>
> Key: SPARK-11258
> URL: https://issues.apache.org/jira/browse/SPARK-11258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Frank Rosner
>
> h4. Introduction
> We tried to collect a DataFrame with > 1 million rows and a few hundred 
> columns in SparkR. This took a huge amount of time (much more than in the 
> Spark REPL). When looking into the code, I found that the 
> {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run 
> time complexity (it goes through the complete data set _m_ times, where _m_ 
> is the number of columns.
> h4. Problem
> The {{dfToCols}} method is transposing the row-wise representation of the 
> Spark DataFrame (array of rows) into a column wise representation (array of 
> columns) to then be put into a data frame. This is done in a very inefficient 
> way, yielding to huge performance (and possibly also memory) problems when 
> collecting bigger data frames.
> h4. Solution
> Directly transpose the row wise representation to the column wise 
> representation with one pass through the data. I will create a pull request 
> for this.
> h4. Runtime comparison
> On a test data frame with 1 million rows and 22 columns, the old `dfToCols` 
> method takes average 2267 ms to complete. My implementation takes only 554 ms 
> on average. This effect gets even bigger, the more columns you have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11258) Remove quadratic runtime complexity for converting a Spark DataFrame into an R data.frame

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11258:


Assignee: (was: Apache Spark)

> Remove quadratic runtime complexity for converting a Spark DataFrame into an 
> R data.frame
> -
>
> Key: SPARK-11258
> URL: https://issues.apache.org/jira/browse/SPARK-11258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Frank Rosner
>
> h4. Introduction
> We tried to collect a DataFrame with > 1 million rows and a few hundred 
> columns in SparkR. This took a huge amount of time (much more than in the 
> Spark REPL). When looking into the code, I found that the 
> {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run 
> time complexity (it goes through the complete data set _m_ times, where _m_ 
> is the number of columns.
> h4. Problem
> The {{dfToCols}} method is transposing the row-wise representation of the 
> Spark DataFrame (array of rows) into a column wise representation (array of 
> columns) to then be put into a data frame. This is done in a very inefficient 
> way, yielding to huge performance (and possibly also memory) problems when 
> collecting bigger data frames.
> h4. Solution
> Directly transpose the row wise representation to the column wise 
> representation with one pass through the data. I will create a pull request 
> for this.
> h4. Runtime comparison
> On a test data frame with 1 million rows and 22 columns, the old `dfToCols` 
> method takes average 2267 ms to complete. My implementation takes only 554 ms 
> on average. This effect gets even bigger, the more columns you have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11258) Remove quadratic runtime complexity for converting a Spark DataFrame into an R data.frame

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11258:


Assignee: Apache Spark

> Remove quadratic runtime complexity for converting a Spark DataFrame into an 
> R data.frame
> -
>
> Key: SPARK-11258
> URL: https://issues.apache.org/jira/browse/SPARK-11258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Frank Rosner
>Assignee: Apache Spark
>
> h4. Introduction
> We tried to collect a DataFrame with > 1 million rows and a few hundred 
> columns in SparkR. This took a huge amount of time (much more than in the 
> Spark REPL). When looking into the code, I found that the 
> {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run 
> time complexity (it goes through the complete data set _m_ times, where _m_ 
> is the number of columns.
> h4. Problem
> The {{dfToCols}} method is transposing the row-wise representation of the 
> Spark DataFrame (array of rows) into a column wise representation (array of 
> columns) to then be put into a data frame. This is done in a very inefficient 
> way, yielding to huge performance (and possibly also memory) problems when 
> collecting bigger data frames.
> h4. Solution
> Directly transpose the row wise representation to the column wise 
> representation with one pass through the data. I will create a pull request 
> for this.
> h4. Runtime comparison
> On a test data frame with 1 million rows and 22 columns, the old `dfToCols` 
> method takes average 2267 ms to complete. My implementation takes only 554 ms 
> on average. This effect gets even bigger, the more columns you have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11181) Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled.

2015-10-22 Thread prakhar jauhari (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968655#comment-14968655
 ] 

prakhar jauhari commented on SPARK-11181:
-

Ya. i'll do it for branch 1.3 :)

> Spark Yarn : Spark reducing total executors count even when Dynamic 
> Allocation is disabled.
> ---
>
> Key: SPARK-11181
> URL: https://issues.apache.org/jira/browse/SPARK-11181
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, YARN
>Affects Versions: 1.3.1
> Environment: Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. 
> All servers in cluster running Linux version 2.6.32. 
> Job in yarn-client mode.
>Reporter: prakhar jauhari
>
> Spark driver reduces total executors count even when Dynamic Allocation is 
> not enabled.
> To reproduce this:
> 1. A 2 node yarn setup : each DN has ~ 20GB mem and 4 cores.
> 2. When the application launches and gets it required executors, One of the 
> DN's losses connectivity and is timed out.
> 3. Spark issues a killExecutor for the executor on the DN which was timed 
> out. 
> 4. Even with dynamic allocation off, spark's scheduler reduces the 
> "targetNumExecutors".
> 5. Thus the job runs with reduced executor count.
> Note : The severity of the issue increases : If some of the DN that were 
> running my job's executors lose connectivity intermittently, spark scheduler 
> reduces "targetNumExecutors", thus not asking for new executors on any other 
> nodes, causing the job to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9319) Add support for setting column names, types

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9319:
---

Assignee: Apache Spark

> Add support for setting column names, types
> ---
>
> Key: SPARK-9319
> URL: https://issues.apache.org/jira/browse/SPARK-9319
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>
> This will help us support functions of the form 
> {code}
> colnames(data) <- c(“Date”, “Arrival_Delay”)
> coltypes(data) <- c(“numeric”, “logical”, “character”)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9319) Add support for setting column names, types

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9319:
---

Assignee: (was: Apache Spark)

> Add support for setting column names, types
> ---
>
> Key: SPARK-9319
> URL: https://issues.apache.org/jira/browse/SPARK-9319
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This will help us support functions of the form 
> {code}
> colnames(data) <- c(“Date”, “Arrival_Delay”)
> coltypes(data) <- c(“numeric”, “logical”, “character”)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9319) Add support for setting column names, types

2015-10-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968657#comment-14968657
 ] 

Apache Spark commented on SPARK-9319:
-

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/9218

> Add support for setting column names, types
> ---
>
> Key: SPARK-9319
> URL: https://issues.apache.org/jira/browse/SPARK-9319
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This will help us support functions of the form 
> {code}
> colnames(data) <- c(“Date”, “Arrival_Delay”)
> coltypes(data) <- c(“numeric”, “logical”, “character”)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9595) Adding API to SparkConf for kryo serializers registration

2015-10-22 Thread Yuhang Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968615#comment-14968615
 ] 

Yuhang Chen commented on SPARK-9595:


Sorry, I replied to the wrong person, the question is meant for another issue, 
please just ignore it.

> Adding API to SparkConf for kryo serializers registration
> -
>
> Key: SPARK-9595
> URL: https://issues.apache.org/jira/browse/SPARK-9595
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1
>Reporter: Yuhang Chen
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Currently SparkConf has a registerKryoClasses API for kryo registration. 
> However, this only works when you register classes. If you want to register 
> customized kryo serializers, you'll have to extend the KryoSerializer class 
> and write some codes.
> This is not only very inconvenient, but also require the registration to be 
> done in compile-time, which is not always possible. Thus, I suggest another 
> API to SparkConf for registering customized kryo serializers. It could be 
> like this:
> def registerKryoSerializers(serializers: Map[Class[_], Serializer]): SparkConf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11254) Thriftserver on kerberos secured cluster in YARN mode

2015-10-22 Thread wpxidian (JIRA)
wpxidian created SPARK-11254:


 Summary: Thriftserver on kerberos secured cluster in YARN mode 
 Key: SPARK-11254
 URL: https://issues.apache.org/jira/browse/SPARK-11254
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1, 1.5.0
 Environment: CDH-5.2.4,
Hadoop-2.5.0,
Spark-1.5.1, Spark-1.5.0,
Kerberos
Reporter: wpxidian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11229) NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0

2015-10-22 Thread Romi Kuntsman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968662#comment-14968662
 ] 

Romi Kuntsman commented on SPARK-11229:
---

[~marmbrus] it's reproducible in 1.5.1 as [~xwu0226] confirmed, shouldn't it be 
marked as "fixed in 1.6.0" instead of "cannot reproduce"?

> NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0
> -
>
> Key: SPARK-11229
> URL: https://issues.apache.org/jira/browse/SPARK-11229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: 14.04.1-Ubuntu SMP x86_64 GNU/Linux
>Reporter: Romi Kuntsman
>
> Steps to reproduce:
> 1. set spark.shuffle.memoryFraction=0
> 2. load dataframe from parquet file
> 3. see it's read correctly by calling dataframe.show()
> 4. call dataframe.count()
> Expected behaviour:
> get count of rows in dataframe
> OR, if memoryFraction=0 is an invalid setting, get notified about it
> Actual behaviour:
> CatalystReadSupport doesn't read the schema (even thought there is one) and 
> then there's a NullPointerException.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
>   at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
>   at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
>   ... 14 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:194)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:192)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:368)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
>   at 
> 

[jira] [Closed] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-22 Thread Sun Rui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui closed SPARK-11190.
---
Resolution: Won't Fix

Close it as the feature is already in the master branch, and will be available 
in next release.

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
> 

[jira] [Commented] (SPARK-11181) Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled.

2015-10-22 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968649#comment-14968649
 ] 

Saisai Shao commented on SPARK-11181:
-

I think you'd better backport to branch 1.3, not for tag 1.3.1 :).

> Spark Yarn : Spark reducing total executors count even when Dynamic 
> Allocation is disabled.
> ---
>
> Key: SPARK-11181
> URL: https://issues.apache.org/jira/browse/SPARK-11181
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, YARN
>Affects Versions: 1.3.1
> Environment: Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. 
> All servers in cluster running Linux version 2.6.32. 
> Job in yarn-client mode.
>Reporter: prakhar jauhari
>
> Spark driver reduces total executors count even when Dynamic Allocation is 
> not enabled.
> To reproduce this:
> 1. A 2 node yarn setup : each DN has ~ 20GB mem and 4 cores.
> 2. When the application launches and gets it required executors, One of the 
> DN's losses connectivity and is timed out.
> 3. Spark issues a killExecutor for the executor on the DN which was timed 
> out. 
> 4. Even with dynamic allocation off, spark's scheduler reduces the 
> "targetNumExecutors".
> 5. Thus the job runs with reduced executor count.
> Note : The severity of the issue increases : If some of the DN that were 
> running my job's executors lose connectivity intermittently, spark scheduler 
> reduces "targetNumExecutors", thus not asking for new executors on any other 
> nodes, causing the job to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5966) Spark-submit deploy-mode incorrectly affecting submission when master = local[4]

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5966:
---

Assignee: Apache Spark  (was: Andrew Or)

> Spark-submit deploy-mode incorrectly affecting submission when master = 
> local[4] 
> -
>
> Key: SPARK-5966
> URL: https://issues.apache.org/jira/browse/SPARK-5966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Critical
>
> {code}
> [tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
> test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
> App arguments:
> Usage: MemoryTest  
> [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster 
>  --class test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> java.lang.ClassNotFoundException:
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5966) Spark-submit deploy-mode incorrectly affecting submission when master = local[4]

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5966:
---

Assignee: Andrew Or  (was: Apache Spark)

> Spark-submit deploy-mode incorrectly affecting submission when master = 
> local[4] 
> -
>
> Key: SPARK-5966
> URL: https://issues.apache.org/jira/browse/SPARK-5966
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Tathagata Das
>Assignee: Andrew Or
>Priority: Critical
>
> {code}
> [tdas @ Zion spark] bin/spark-submit --master local[10]  --class 
> test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> JVM arguments: [-XX:MaxPermSize=128m, -Xms1G, -Xmx1G]
> App arguments:
> Usage: MemoryTest  
> [tdas @ Zion spark] bin/spark-submit --master local[10] --deploy-mode cluster 
>  --class test.MemoryTest --driver-memory 1G 
> /Users/tdas/Projects/Others/simple-project/target/scala-2.10/simple-app-assembly-0.1-SNAPSHOT.jar
> java.lang.ClassNotFoundException:
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:538)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11258) Remove quadratic runtime complexity for converting a Spark DataFrame into an R data.frame

2015-10-22 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-11258:
-
Description: 
h4. Introduction

We tried to collect a DataFrame with > 1 million rows and a few hundred columns 
in SparkR. This took a huge amount of time (much more than in the Spark REPL). 
When looking into the code, I found that the 
{{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run time 
complexity (it goes through the complete data set _m_ times, where _m_ is the 
number of columns.

h4. Problem

The {{dfToCols}} method is transposing the row-wise representation of the Spark 
DataFrame (array of rows) into a column wise representation (array of columns) 
to then be put into a data frame. This is done in a very inefficient way, 
yielding to huge performance (and possibly also memory) problems when 
collecting bigger data frames.

h4. Solution

Directly transpose the row wise representation to the column wise 
representation with one pass through the data. I will create a pull request for 
this.

h4. Runtime comparison

On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} 
method takes average 2267 ms to complete. My implementation takes only 554 ms 
on average. This effect gets even bigger, the more columns you have.

  was:
h4. Introduction

We tried to collect a DataFrame with > 1 million rows and a few hundred columns 
in SparkR. This took a huge amount of time (much more than in the Spark REPL). 
When looking into the code, I found that the 
{{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run time 
complexity (it goes through the complete data set _m_ times, where _m_ is the 
number of columns.

h4. Problem

The {{dfToCols}} method is transposing the row-wise representation of the Spark 
DataFrame (array of rows) into a column wise representation (array of columns) 
to then be put into a data frame. This is done in a very inefficient way, 
yielding to huge performance (and possibly also memory) problems when 
collecting bigger data frames.

h4. Solution

Directly transpose the row wise representation to the column wise 
representation with one pass through the data. I will create a pull request for 
this.

h4. Runtime comparison

On a test data frame with 1 million rows and 22 columns, the old `dfToCols` 
method takes average 2267 ms to complete. My implementation takes only 554 ms 
on average. This effect gets even bigger, the more columns you have.


> Remove quadratic runtime complexity for converting a Spark DataFrame into an 
> R data.frame
> -
>
> Key: SPARK-11258
> URL: https://issues.apache.org/jira/browse/SPARK-11258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Frank Rosner
>
> h4. Introduction
> We tried to collect a DataFrame with > 1 million rows and a few hundred 
> columns in SparkR. This took a huge amount of time (much more than in the 
> Spark REPL). When looking into the code, I found that the 
> {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run 
> time complexity (it goes through the complete data set _m_ times, where _m_ 
> is the number of columns.
> h4. Problem
> The {{dfToCols}} method is transposing the row-wise representation of the 
> Spark DataFrame (array of rows) into a column wise representation (array of 
> columns) to then be put into a data frame. This is done in a very inefficient 
> way, yielding to huge performance (and possibly also memory) problems when 
> collecting bigger data frames.
> h4. Solution
> Directly transpose the row wise representation to the column wise 
> representation with one pass through the data. I will create a pull request 
> for this.
> h4. Runtime comparison
> On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} 
> method takes average 2267 ms to complete. My implementation takes only 554 ms 
> on average. This effect gets even bigger, the more columns you have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11195) Exception thrown on executor throws ClassNotFound on driver

2015-10-22 Thread Hurshal Patel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969535#comment-14969535
 ] 

Hurshal Patel commented on SPARK-11195:
---

the only difference between my repro and yours is that I have a fatjar with all 
my classes and you are providing the deps with --jars but in either case the 
driver doesn't have the correct classpath.

> Exception thrown on executor throws ClassNotFound on driver
> ---
>
> Key: SPARK-11195
> URL: https://issues.apache.org/jira/browse/SPARK-11195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Hurshal Patel
>
> I have a minimal repro job
> {code:title=Repro.scala}
> package repro
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkException
> class MyException(message: String) extends Exception(message: String)
> object Repro {
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("MyException ClassNotFound Repro")
> val sc = new SparkContext(conf)
> sc.parallelize(List(1)).map { x =>
>   throw new repro.MyException("this is a failure")
>   true
> }.collect()
>   }
> }
> {code}
> On Spark 1.4.1, I get a task failure with the reason correctly set to 
> MyException.
> On Spark 1.5.1, I _expect_ the same behavior, but instead I get a task 
> failure with an UnknownReason caused by ClassNotFoundException.
>  
> here is the job on vanilla Spark 1.4.1:
> {code:title=spark_1.5.1_log}
> $ ./bin/spark-submit --master local --deploy-mode client --class repro.Repro 
> /home/nix/repro/target/scala-2.10/repro-assembly-0.0.1.jar
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 15/10/19 11:55:20 INFO SparkContext: Running Spark version 1.4.1
> 15/10/19 11:55:21 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/10/19 11:55:22 WARN Utils: Your hostname, choochootrain resolves to a 
> loopback address: 127.0.1.1; using 10.0.1.97 instead (on interface wlan0)
> 15/10/19 11:55:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 15/10/19 11:55:22 INFO SecurityManager: Changing view acls to: root
> 15/10/19 11:55:22 INFO SecurityManager: Changing modify acls to: root
> 15/10/19 11:55:22 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); users 
> with modify permissions: Set(root)
> 15/10/19 11:55:24 INFO Slf4jLogger: Slf4jLogger started
> 15/10/19 11:55:24 INFO Remoting: Starting remoting
> 15/10/19 11:55:24 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriver@10.0.1.97:46683]
> 15/10/19 11:55:24 INFO Utils: Successfully started service 'sparkDriver' on 
> port 46683.
> 15/10/19 11:55:24 INFO SparkEnv: Registering MapOutputTracker
> 15/10/19 11:55:24 INFO SparkEnv: Registering BlockManagerMaster
> 15/10/19 11:55:24 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-0348a320-0ca3-4528-9ab5-9ba37d3c2e07/blockmgr-08496143-1d9d-41c8-a581-b6220edf00d5
> 15/10/19 11:55:24 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
> 15/10/19 11:55:25 INFO HttpFileServer: HTTP File server directory is 
> /tmp/spark-0348a320-0ca3-4528-9ab5-9ba37d3c2e07/httpd-52c396d2-b47f-45a5-bb76-d10aa864e6d5
> 15/10/19 11:55:25 INFO HttpServer: Starting HTTP Server
> 15/10/19 11:55:25 INFO Utils: Successfully started service 'HTTP file server' 
> on port 47915.
> 15/10/19 11:55:25 INFO SparkEnv: Registering OutputCommitCoordinator
> 15/10/19 11:55:25 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 15/10/19 11:55:25 INFO SparkUI: Started SparkUI at http://10.0.1.97:4040
> 15/10/19 11:55:25 INFO SparkContext: Added JAR 
> file:/home/nix/repro/target/scala-2.10/repro-assembly-0.0.1.jar at 
> http://10.0.1.97:47915/jars/repro-assembly-0.0.1.jar with timestamp 
> 1445280925969
> 15/10/19 11:55:26 INFO Executor: Starting executor ID driver on host localhost
> 15/10/19 11:55:26 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46569.
> 15/10/19 11:55:26 INFO NettyBlockTransferService: Server created on 46569
> 15/10/19 11:55:26 INFO BlockManagerMaster: Trying to register BlockManager
> 15/10/19 11:55:26 INFO BlockManagerMasterEndpoint: Registering block manager 
> localhost:46569 with 265.4 MB RAM, BlockManagerId(driver, localhost, 46569)
> 15/10/19 11:55:26 INFO BlockManagerMaster: Registered BlockManager
> 15/10/19 11:55:27 INFO SparkContext: Starting job: collect at repro.scala:18
> 15/10/19 11:55:27 INFO DAGScheduler: Got job 0 (collect at repro.scala:18) 
> with 1 output partitions 

[jira] [Created] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron

2015-10-22 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-11262:


 Summary: Unit test for gradient, loss layers, memory management 
for multilayer perceptron
 Key: SPARK-11262
 URL: https://issues.apache.org/jira/browse/SPARK-11262
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.5.1
Reporter: Alexander Ulanov
 Fix For: 1.5.1


Multi-layer perceptron requires more rigorous tests and refactoring of layer 
interfaces to accommodate development of new features.
1)Implement unit test for gradient and loss
2)Refactor the internal layer interface to extract "loss function" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11245) Upgrade twitter4j to version 4.x

2015-10-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969527#comment-14969527
 ] 

Apache Spark commented on SPARK-11245:
--

User 'pronix' has created a pull request for this issue:
https://github.com/apache/spark/pull/9221

> Upgrade twitter4j to version 4.x
> 
>
> Key: SPARK-11245
> URL: https://issues.apache.org/jira/browse/SPARK-11245
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Luciano Resende
>
> Twitter4J is already on 4.x release
> https://github.com/yusuke/twitter4j/releases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11263) lintr Throws Warnings on Commented Code in Documentation

2015-10-22 Thread Sen Fang (JIRA)
Sen Fang created SPARK-11263:


 Summary: lintr Throws Warnings on Commented Code in Documentation
 Key: SPARK-11263
 URL: https://issues.apache.org/jira/browse/SPARK-11263
 Project: Spark
  Issue Type: Task
  Components: SparkR
Reporter: Sen Fang
Priority: Minor


This comes from a discussion in https://github.com/apache/spark/pull/9205

Currently lintr throws many warnings around "style: Commented code should be 
removed."

For example
{code}
R/RDD.R:260:3: style: Commented code should be removed.
# unpersist(rdd) # rdd@@env$isCached == FALSE
  ^~~
R/RDD.R:283:3: style: Commented code should be removed.
# sc <- sparkR.init()
  ^~~
R/RDD.R:284:3: style: Commented code should be removed.
# setCheckpointDir(sc, "checkpoint")
  ^~
{code}

Some of them are legitimate warnings but most of them are simply code examples 
of functions that are not part of public API. For example
{code}
# @examples
#\dontrun{
# sc <- sparkR.init()
# rdd <- parallelize(sc, 1:10, 2L)
# cache(rdd)
#}
{code}


One workaround is to convert them back to Roxygen doc but assign {{#' @rdname 
.ignore}} and Roxygen will skip these functions with message {{Skipping invalid 
path: .ignore.Rd}}

That being said, I feel people usually praise/criticize R package documentation 
is "expert friendly". The convention seems to be providing as much 
documentation as possible but don't export functions that is unstable or 
developer only. If users choose to use them, they acknowledge the risk by using 
{{:::}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11078) Ensure spilling tests are actually spilling

2015-10-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-11078.

   Resolution: Fixed
Fix Version/s: 1.6.0

> Ensure spilling tests are actually spilling
> ---
>
> Key: SPARK-11078
> URL: https://issues.apache.org/jira/browse/SPARK-11078
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Tests
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.6.0
>
>
> The new unified memory management model in SPARK-10983 uncovered many brittle 
> tests that rely on arbitrary thresholds to detect spilling. Some tests don't 
> even assert that spilling did occur.
> We should go through all the places where we test spilling behavior and 
> correct the tests, a subset of which are definitely incorrect. Potential 
> suspects:
> - UnsafeShuffleSuite
> - ExternalAppendOnlyMapSuite
> - ExternalSorterSuite
> - SQLQuerySuite
> - DistributedSuite



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11264) ./bin/spark-class can't find assembly jars with certain GREP_OPTIONS set

2015-10-22 Thread Jeffrey Naisbitt (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeffrey Naisbitt updated SPARK-11264:
-
Description: 
Some GREP_OPTIONS will modify the output of the grep commands that are looking 
for the assembly jars in bin/spark-class.

For example, if the -n option is specified, the grep output will look like: 
{code}5:spark-assembly-1.5.1-hadoop2.4.0.jar{code}
This will not match the regular expressions, and so the jar files will not be 
found.  We could improve the regular expression to handle cases like this and 
trim off extra characters, but it is difficult to know which options may or may 
not be set.  Unsetting GREP_OPTIONS within the script handles all the cases and 
gives the desired output.

By the way, the actual error seen from the commandline was this:
{code}Error: Could not find or load main class 
org.apache.spark.launcher.Main{code}

  was:
Some GREP_OPTIONS will modify the output of the grep commands that are looking 
for the assembly jars in bin/spark-class.

For example, if the -n option is specified, the grep output will look like: 
{code}5:spark-assembly-1.5.1-hadoop2.4.0.jar{code}
This will not match the regular expressions, and so the jar files will not be 
found.  We could improve the regular expression to handle cases like this and 
trim off extra characters, but it is difficult to know which options may or may 
not be set.  Unsetting GREP_OPTIONS within the script handles all the cases and 
gives the desired output.


> ./bin/spark-class can't find assembly jars with certain GREP_OPTIONS set
> 
>
> Key: SPARK-11264
> URL: https://issues.apache.org/jira/browse/SPARK-11264
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.5.1
>Reporter: Jeffrey Naisbitt
>Priority: Minor
>
> Some GREP_OPTIONS will modify the output of the grep commands that are 
> looking for the assembly jars in bin/spark-class.
> For example, if the -n option is specified, the grep output will look like: 
> {code}5:spark-assembly-1.5.1-hadoop2.4.0.jar{code}
> This will not match the regular expressions, and so the jar files will not be 
> found.  We could improve the regular expression to handle cases like this and 
> trim off extra characters, but it is difficult to know which options may or 
> may not be set.  Unsetting GREP_OPTIONS within the script handles all the 
> cases and gives the desired output.
> By the way, the actual error seen from the commandline was this:
> {code}Error: Could not find or load main class 
> org.apache.spark.launcher.Main{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11195) Exception thrown on executor throws ClassNotFound on driver

2015-10-22 Thread Hurshal Patel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969535#comment-14969535
 ] 

Hurshal Patel edited comment on SPARK-11195 at 10/22/15 5:47 PM:
-

this is very likely the same issue. the only difference between my repro and 
yours is that I have a fatjar with all my classes and you are providing the 
deps with --jars but in either case the driver doesn't have the correct 
classpath.


was (Author: choochootrain):
the only difference between my repro and yours is that I have a fatjar with all 
my classes and you are providing the deps with --jars but in either case the 
driver doesn't have the correct classpath.

> Exception thrown on executor throws ClassNotFound on driver
> ---
>
> Key: SPARK-11195
> URL: https://issues.apache.org/jira/browse/SPARK-11195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Hurshal Patel
>
> I have a minimal repro job
> {code:title=Repro.scala}
> package repro
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkException
> class MyException(message: String) extends Exception(message: String)
> object Repro {
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("MyException ClassNotFound Repro")
> val sc = new SparkContext(conf)
> sc.parallelize(List(1)).map { x =>
>   throw new repro.MyException("this is a failure")
>   true
> }.collect()
>   }
> }
> {code}
> On Spark 1.4.1, I get a task failure with the reason correctly set to 
> MyException.
> On Spark 1.5.1, I _expect_ the same behavior, but instead I get a task 
> failure with an UnknownReason caused by ClassNotFoundException.
>  
> here is the job on vanilla Spark 1.4.1:
> {code:title=spark_1.5.1_log}
> $ ./bin/spark-submit --master local --deploy-mode client --class repro.Repro 
> /home/nix/repro/target/scala-2.10/repro-assembly-0.0.1.jar
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 15/10/19 11:55:20 INFO SparkContext: Running Spark version 1.4.1
> 15/10/19 11:55:21 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/10/19 11:55:22 WARN Utils: Your hostname, choochootrain resolves to a 
> loopback address: 127.0.1.1; using 10.0.1.97 instead (on interface wlan0)
> 15/10/19 11:55:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 15/10/19 11:55:22 INFO SecurityManager: Changing view acls to: root
> 15/10/19 11:55:22 INFO SecurityManager: Changing modify acls to: root
> 15/10/19 11:55:22 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); users 
> with modify permissions: Set(root)
> 15/10/19 11:55:24 INFO Slf4jLogger: Slf4jLogger started
> 15/10/19 11:55:24 INFO Remoting: Starting remoting
> 15/10/19 11:55:24 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriver@10.0.1.97:46683]
> 15/10/19 11:55:24 INFO Utils: Successfully started service 'sparkDriver' on 
> port 46683.
> 15/10/19 11:55:24 INFO SparkEnv: Registering MapOutputTracker
> 15/10/19 11:55:24 INFO SparkEnv: Registering BlockManagerMaster
> 15/10/19 11:55:24 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-0348a320-0ca3-4528-9ab5-9ba37d3c2e07/blockmgr-08496143-1d9d-41c8-a581-b6220edf00d5
> 15/10/19 11:55:24 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
> 15/10/19 11:55:25 INFO HttpFileServer: HTTP File server directory is 
> /tmp/spark-0348a320-0ca3-4528-9ab5-9ba37d3c2e07/httpd-52c396d2-b47f-45a5-bb76-d10aa864e6d5
> 15/10/19 11:55:25 INFO HttpServer: Starting HTTP Server
> 15/10/19 11:55:25 INFO Utils: Successfully started service 'HTTP file server' 
> on port 47915.
> 15/10/19 11:55:25 INFO SparkEnv: Registering OutputCommitCoordinator
> 15/10/19 11:55:25 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 15/10/19 11:55:25 INFO SparkUI: Started SparkUI at http://10.0.1.97:4040
> 15/10/19 11:55:25 INFO SparkContext: Added JAR 
> file:/home/nix/repro/target/scala-2.10/repro-assembly-0.0.1.jar at 
> http://10.0.1.97:47915/jars/repro-assembly-0.0.1.jar with timestamp 
> 1445280925969
> 15/10/19 11:55:26 INFO Executor: Starting executor ID driver on host localhost
> 15/10/19 11:55:26 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46569.
> 15/10/19 11:55:26 INFO NettyBlockTransferService: Server created on 46569
> 15/10/19 11:55:26 INFO BlockManagerMaster: Trying to register BlockManager
> 15/10/19 11:55:26 INFO BlockManagerMasterEndpoint: Registering block manager 
> localhost:46569 with 

[jira] [Commented] (SPARK-11264) ./bin/spark-class can't find assembly jars with certain GREP_OPTIONS set

2015-10-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969745#comment-14969745
 ] 

Apache Spark commented on SPARK-11264:
--

User 'naisbitt' has created a pull request for this issue:
https://github.com/apache/spark/pull/9231

> ./bin/spark-class can't find assembly jars with certain GREP_OPTIONS set
> 
>
> Key: SPARK-11264
> URL: https://issues.apache.org/jira/browse/SPARK-11264
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Jeffrey Naisbitt
>Priority: Minor
>
> Some GREP_OPTIONS will modify the output of the grep commands that are 
> looking for the assembly jars in bin/spark-class.
> For example, if the -n option is specified, the grep output will look like: 
> {code}5:spark-assembly-1.5.1-hadoop2.4.0.jar{code}
> This will not match the regular expressions, and so the jar files will not be 
> found.  We could improve the regular expression to handle cases like this and 
> trim off extra characters, but it is difficult to know which options may or 
> may not be set.  Unsetting GREP_OPTIONS within the script handles all the 
> cases and gives the desired output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11264) ./bin/spark-class can't find assembly jars with certain GREP_OPTIONS set

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11264:


Assignee: (was: Apache Spark)

> ./bin/spark-class can't find assembly jars with certain GREP_OPTIONS set
> 
>
> Key: SPARK-11264
> URL: https://issues.apache.org/jira/browse/SPARK-11264
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Jeffrey Naisbitt
>Priority: Minor
>
> Some GREP_OPTIONS will modify the output of the grep commands that are 
> looking for the assembly jars in bin/spark-class.
> For example, if the -n option is specified, the grep output will look like: 
> {code}5:spark-assembly-1.5.1-hadoop2.4.0.jar{code}
> This will not match the regular expressions, and so the jar files will not be 
> found.  We could improve the regular expression to handle cases like this and 
> trim off extra characters, but it is difficult to know which options may or 
> may not be set.  Unsetting GREP_OPTIONS within the script handles all the 
> cases and gives the desired output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11264) ./bin/spark-class can't find assembly jars with certain GREP_OPTIONS set

2015-10-22 Thread Jeffrey Naisbitt (JIRA)
Jeffrey Naisbitt created SPARK-11264:


 Summary: ./bin/spark-class can't find assembly jars with certain 
GREP_OPTIONS set
 Key: SPARK-11264
 URL: https://issues.apache.org/jira/browse/SPARK-11264
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.1
Reporter: Jeffrey Naisbitt
Priority: Minor


Some GREP_OPTIONS will modify the output of the grep commands that are looking 
for the assembly jars in bin/spark-class.

For example, if the -n option is specified, the grep output will look like: 
{code}5:spark-assembly-1.5.1-hadoop2.4.0.jar{code}
This will not match the regular expressions, and so the jar files will not be 
found.  We could improve the regular expression to handle cases like this and 
trim off extra characters, but it is difficult to know which options may or may 
not be set.  Unsetting GREP_OPTIONS within the script handles all the cases and 
gives the desired output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11264) ./bin/spark-class can't find assembly jars with certain GREP_OPTIONS set

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11264:


Assignee: Apache Spark

> ./bin/spark-class can't find assembly jars with certain GREP_OPTIONS set
> 
>
> Key: SPARK-11264
> URL: https://issues.apache.org/jira/browse/SPARK-11264
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Jeffrey Naisbitt
>Assignee: Apache Spark
>Priority: Minor
>
> Some GREP_OPTIONS will modify the output of the grep commands that are 
> looking for the assembly jars in bin/spark-class.
> For example, if the -n option is specified, the grep output will look like: 
> {code}5:spark-assembly-1.5.1-hadoop2.4.0.jar{code}
> This will not match the regular expressions, and so the jar files will not be 
> found.  We could improve the regular expression to handle cases like this and 
> trim off extra characters, but it is difficult to know which options may or 
> may not be set.  Unsetting GREP_OPTIONS within the script handles all the 
> cases and gives the desired output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11261) Provide a more flexible alternative to Jdbc RDD

2015-10-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969595#comment-14969595
 ] 

Apache Spark commented on SPARK-11261:
--

User 'rmarsch' has created a pull request for this issue:
https://github.com/apache/spark/pull/9228

> Provide a more flexible alternative to Jdbc RDD
> ---
>
> Key: SPARK-11261
> URL: https://issues.apache.org/jira/browse/SPARK-11261
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Richard Marscher
>
> The existing JdbcRDD only covers a limited number of use cases by requiring 
> the semantics of your query to operate on upper and lower bound predicates 
> like: "select title, author from books where ? <= id and id <= ?"
> However, there are many use cases that cannot use such a method and/or are 
> much more inefficient doing so.
> For example, we have a MySQL table partitioned on a partition key. We don't 
> have range values to lookup but rather want to get all entries matching a 
> predicate and have Spark run 1 query in a partition against each logical 
> partition of our MySQL table. For example: "select * from devices where 
> partition_id = ? and app_id = 'abcd'".
> Another use case, looking up against a distinct set of identifiers that don't 
> fall within an ordering. "select * from users where user_id in 
> (?,?,?,?,?,?,?)". The number of identifiers may be quite large and/or dynamic.
> Solution:
> Instead of addressing each use case differently with new RDD types, provide 
> an alternate, general RDD that gives the user direct control over how the 
> query is partitioned in Spark and filling in the placeholders.
> The user should be able to control which placeholder values are available on 
> each partition of the RDD and also how they are inserted into the 
> PreparedStatement. Ideally it can support dynamic placeholder values like 
> inserting a set of values for an IN clause or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11261) Provide a more flexible alternative to Jdbc RDD

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11261:


Assignee: Apache Spark

> Provide a more flexible alternative to Jdbc RDD
> ---
>
> Key: SPARK-11261
> URL: https://issues.apache.org/jira/browse/SPARK-11261
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Richard Marscher
>Assignee: Apache Spark
>
> The existing JdbcRDD only covers a limited number of use cases by requiring 
> the semantics of your query to operate on upper and lower bound predicates 
> like: "select title, author from books where ? <= id and id <= ?"
> However, there are many use cases that cannot use such a method and/or are 
> much more inefficient doing so.
> For example, we have a MySQL table partitioned on a partition key. We don't 
> have range values to lookup but rather want to get all entries matching a 
> predicate and have Spark run 1 query in a partition against each logical 
> partition of our MySQL table. For example: "select * from devices where 
> partition_id = ? and app_id = 'abcd'".
> Another use case, looking up against a distinct set of identifiers that don't 
> fall within an ordering. "select * from users where user_id in 
> (?,?,?,?,?,?,?)". The number of identifiers may be quite large and/or dynamic.
> Solution:
> Instead of addressing each use case differently with new RDD types, provide 
> an alternate, general RDD that gives the user direct control over how the 
> query is partitioned in Spark and filling in the placeholders.
> The user should be able to control which placeholder values are available on 
> each partition of the RDD and also how they are inserted into the 
> PreparedStatement. Ideally it can support dynamic placeholder values like 
> inserting a set of values for an IN clause or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11261) Provide a more flexible alternative to Jdbc RDD

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11261:


Assignee: (was: Apache Spark)

> Provide a more flexible alternative to Jdbc RDD
> ---
>
> Key: SPARK-11261
> URL: https://issues.apache.org/jira/browse/SPARK-11261
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Richard Marscher
>
> The existing JdbcRDD only covers a limited number of use cases by requiring 
> the semantics of your query to operate on upper and lower bound predicates 
> like: "select title, author from books where ? <= id and id <= ?"
> However, there are many use cases that cannot use such a method and/or are 
> much more inefficient doing so.
> For example, we have a MySQL table partitioned on a partition key. We don't 
> have range values to lookup but rather want to get all entries matching a 
> predicate and have Spark run 1 query in a partition against each logical 
> partition of our MySQL table. For example: "select * from devices where 
> partition_id = ? and app_id = 'abcd'".
> Another use case, looking up against a distinct set of identifiers that don't 
> fall within an ordering. "select * from users where user_id in 
> (?,?,?,?,?,?,?)". The number of identifiers may be quite large and/or dynamic.
> Solution:
> Instead of addressing each use case differently with new RDD types, provide 
> an alternate, general RDD that gives the user direct control over how the 
> query is partitioned in Spark and filling in the placeholders.
> The user should be able to control which placeholder values are available on 
> each partition of the RDD and also how they are inserted into the 
> PreparedStatement. Ideally it can support dynamic placeholder values like 
> inserting a set of values for an IN clause or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron

2015-10-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969637#comment-14969637
 ] 

Apache Spark commented on SPARK-11262:
--

User 'avulanov' has created a pull request for this issue:
https://github.com/apache/spark/pull/9229

> Unit test for gradient, loss layers, memory management for multilayer 
> perceptron
> 
>
> Key: SPARK-11262
> URL: https://issues.apache.org/jira/browse/SPARK-11262
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.1
>Reporter: Alexander Ulanov
> Fix For: 1.5.1
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Multi-layer perceptron requires more rigorous tests and refactoring of layer 
> interfaces to accommodate development of new features.
> 1)Implement unit test for gradient and loss
> 2)Refactor the internal layer interface to extract "loss function" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11262:


Assignee: Apache Spark

> Unit test for gradient, loss layers, memory management for multilayer 
> perceptron
> 
>
> Key: SPARK-11262
> URL: https://issues.apache.org/jira/browse/SPARK-11262
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.1
>Reporter: Alexander Ulanov
>Assignee: Apache Spark
> Fix For: 1.5.1
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Multi-layer perceptron requires more rigorous tests and refactoring of layer 
> interfaces to accommodate development of new features.
> 1)Implement unit test for gradient and loss
> 2)Refactor the internal layer interface to extract "loss function" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11262:


Assignee: (was: Apache Spark)

> Unit test for gradient, loss layers, memory management for multilayer 
> perceptron
> 
>
> Key: SPARK-11262
> URL: https://issues.apache.org/jira/browse/SPARK-11262
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.1
>Reporter: Alexander Ulanov
> Fix For: 1.5.1
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Multi-layer perceptron requires more rigorous tests and refactoring of layer 
> interfaces to accommodate development of new features.
> 1)Implement unit test for gradient and loss
> 2)Refactor the internal layer interface to extract "loss function" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11265) YarnClient cant get tokens to talk to Hive in a secure cluster

2015-10-22 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-11265:
--

 Summary: YarnClient cant get tokens to talk to Hive in a secure 
cluster
 Key: SPARK-11265
 URL: https://issues.apache.org/jira/browse/SPARK-11265
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.5.1
 Environment: Kerberized Hadoop cluster
Reporter: Steve Loughran


As reported on the dev list, trying to run a YARN client which wants to talk to 
Hive in a Kerberized hadoop cluster fails. This appears to be because the 
constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was made 
private and replaced with a factory method. The YARN client uses reflection to 
get the tokens, so the signature changes weren't picked up in SPARK-8064.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11200.
---
Resolution: Cannot Reproduce

> NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
> -
>
> Key: SPARK-11200
> URL: https://issues.apache.org/jira/browse/SPARK-11200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: hujiayin
>
> The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
> up after start any of workloads in MLlib until a manual stop from person. The 
> environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
> The error is from NettyRpcEnv.scala. I don't have enough time to look into 
> this issue at this time, but I can verify issue in environment with you if 
> you have fix. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11242) In conf/spark-env.sh.template SPARK_DRIVER_MEMORY is documented incorrectly

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11242:
--
Priority: Trivial  (was: Minor)

> In conf/spark-env.sh.template SPARK_DRIVER_MEMORY is documented incorrectly
> ---
>
> Key: SPARK-11242
> URL: https://issues.apache.org/jira/browse/SPARK-11242
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.1
>Reporter: Xiu
>Priority: Trivial
>
> In conf/spark-env.sh.template
> https://github.com/apache/spark/blob/master/conf/spark-env.sh.template#L42
> # - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 1G)
> SPARK_DRIVER_MEMORY is memory config for driver, not master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11162) Allow enabling debug logging from the command line

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11162.
---
Resolution: Not A Problem

> Allow enabling debug logging from the command line
> --
>
> Key: SPARK-11162
> URL: https://issues.apache.org/jira/browse/SPARK-11162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> Per [~vanzin] on [the user 
> list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html],
>  it would be nice if debug-logging could be enabled from the command line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11181) Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled.

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11181.
---
Resolution: Duplicate

I pretty strongly doubt there will be another 1.3.x release [~prakhar088] so I 
would look at upgrading instead.

> Spark Yarn : Spark reducing total executors count even when Dynamic 
> Allocation is disabled.
> ---
>
> Key: SPARK-11181
> URL: https://issues.apache.org/jira/browse/SPARK-11181
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, YARN
>Affects Versions: 1.3.1
> Environment: Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. 
> All servers in cluster running Linux version 2.6.32. 
> Job in yarn-client mode.
>Reporter: prakhar jauhari
>
> Spark driver reduces total executors count even when Dynamic Allocation is 
> not enabled.
> To reproduce this:
> 1. A 2 node yarn setup : each DN has ~ 20GB mem and 4 cores.
> 2. When the application launches and gets it required executors, One of the 
> DN's losses connectivity and is timed out.
> 3. Spark issues a killExecutor for the executor on the DN which was timed 
> out. 
> 4. Even with dynamic allocation off, spark's scheduler reduces the 
> "targetNumExecutors".
> 5. Thus the job runs with reduced executor count.
> Note : The severity of the issue increases : If some of the DN that were 
> running my job's executors lose connectivity intermittently, spark scheduler 
> reduces "targetNumExecutors", thus not asking for new executors on any other 
> nodes, causing the job to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11257) Spark dataframe negate filter conditions

2015-10-22 Thread Lokesh Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Kumar closed SPARK-11257.

Resolution: Fixed

Came to know that '!' unary op can be replaced with NOT and its working fine.
Hence closing the issue

> Spark dataframe negate filter conditions
> 
>
> Key: SPARK-11257
> URL: https://issues.apache.org/jira/browse/SPARK-11257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Fedora 21 core i5
>Reporter: Lokesh Kumar
>  Labels: bug
> Fix For: 1.5.0
>
>
> I am trying to apply a negation of filter condition on the DataFrame as shown 
> below.
> !(`Ship Mode` LIKE '%Truck%')
> Which is throwing an exception below
> Exception in thread "main" java.lang.RuntimeException: [1.3] failure: 
> identifier expected
> (!(`Ship Mode` LIKE '%Truck%'))
>   ^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:47)
> at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:748)
> at Main.main(Main.java:73)
> Where as the same kind of negative filter conditions are working fine in 
> MySQL. Please find below
> mysql> select count(*) from audit_log where !(operation like '%Log%' or 
> operation like '%Proj%');
> +--+
> | count(*) |
> +--+
> |  129 |
> +--+
> 1 row in set (0.05 sec)
> Can anyone please let me know if this is planned to be fixed in Spark 
> DataFrames in future releases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11179:
--
Assignee: Nitin Goyal

> Push filters through aggregate if filters are subset of 'group by' expressions
> --
>
> Key: SPARK-11179
> URL: https://issues.apache.org/jira/browse/SPARK-11179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nitin Goyal
>Assignee: Nitin Goyal
>Priority: Minor
> Fix For: 1.6.0
>
>
> Push filters through aggregate if filters are subset of 'group by' 
> expressions. This optimisation can be added in Spark SQL's Optimizer class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10463) remove PromotePrecision during optimization

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10463:
--
Assignee: Adrian Wang

> remove PromotePrecision during optimization
> ---
>
> Key: SPARK-10463
> URL: https://issues.apache.org/jira/browse/SPARK-10463
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Trivial
> Fix For: 1.6.0
>
>
> This node is not necessary after HiveTypeCoercion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11102) Uninformative exception when specifing non-exist input for JSON data source

2015-10-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969019#comment-14969019
 ] 

Apache Spark commented on SPARK-11102:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9223

> Uninformative exception when specifing non-exist input for JSON data source
> ---
>
> Key: SPARK-11102
> URL: https://issues.apache.org/jira/browse/SPARK-11102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> If I specify a non-exist input path for json data source, the following 
> exception will be thrown, it is not readable. 
> {code}
> 15/10/14 16:14:39 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
> in memory (estimated size 19.9 KB, free 251.4 KB)
> 15/10/14 16:14:39 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
> on 192.168.3.3:54725 (size: 19.9 KB, free: 2.2 GB)
> 15/10/14 16:14:39 INFO SparkContext: Created broadcast 0 from json at 
> :19
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:201)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1087)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1085)
>   at 
> org.apache.spark.sql.execution.datasources.json.InferSchema$.apply(InferSchema.scala:58)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$6.apply(JSONRelation.scala:100)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema$lzycompute(JSONRelation.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation.dataSchema(JSONRelation.scala:99)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
>   at 
> org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:122)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:106)
>   at org.apache.spark.sql.DataFrameReader.json(DataFrameReader.scala:221)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:19)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:24)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:26)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:28)
>   at $iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC.(:32)
>   at $iwC$$iwC.(:34)
>   at $iwC.(:36)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11227) Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11227.
---
Resolution: Not A Problem

> Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1
> 
>
> Key: SPARK-11227
> URL: https://issues.apache.org/jira/browse/SPARK-11227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: OS: CentOS 6.6
> Memory: 28G
> CPU: 8
> Mesos: 0.22.0
> HDFS: Hadoop 2.6.0-CDH5.4.0 (build by Cloudera Manager)
>Reporter: Yuri Saito
>
> When running jar including Spark Job at HDFS HA Cluster, Mesos and 
> Spark1.5.1, the job throw Exception as "java.net.UnknownHostException: 
> nameservice1" and fail.
> I do below in Terminal.
> {code}
> /opt/spark/bin/spark-submit \
>   --class com.example.Job /jobs/job-assembly-1.0.0.jar
> {code}
> So, job throw below message.
> {code}
> 15/10/21 15:22:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
> (TID 0, spark003.example.com): java.lang.IllegalArgumentException: 
> java.net.UnknownHostException: nameservice1
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
> at 
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at scala.Option.map(Option.scala:145)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.UnknownHostException: nameservice1
> ... 41 more
> {code}
> But, I changed from Spark Cluster 1.5.1 to Spark Cluster 1.4.0, then run the 
> job, job complete with Success.
> In Addition, I disable High Availability on HDFS, then run the job, job 
> complete with Success.
> So, I think Spark1.5 and higher have bug as 

[jira] [Closed] (SPARK-11213) Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-11213.
-

> Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4
> ---
>
> Key: SPARK-11213
> URL: https://issues.apache.org/jira/browse/SPARK-11213
> Project: Spark
>  Issue Type: Bug
>Reporter: Ankit
>
> Hello Guys,
> We have a Cloudera Dist 5.4 ad it has spark 1.3 version 
> Issue 
> we have data sciencetis work on R Script so was searching a ways to submit a 
> r script using ozie or local spark submit to a remoter Yarn resource manager 
> can anyone share the steps to do the same it really difficult to guess the 
> steps , 
> Thanks in advance 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11248) Spark hivethriftserver is using the wrong user to while getting HDFS permissions

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11248:
--
Component/s: SQL

> Spark hivethriftserver is using the wrong user to while getting HDFS 
> permissions
> 
>
> Key: SPARK-11248
> URL: https://issues.apache.org/jira/browse/SPARK-11248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Trystan Leftwich
>
> While running spark as a hivethrift-server via Yarn Spark will use the user 
> running the Hivethrift server rather than the user connecting via JDBC to 
> check HDFS perms.
> i.e.
> In HDFS the perms are
> rwx--   3 testuser testuser /user/testuser/table/testtable
> And i connect via beeline as user testuser
> beeline -u 'jdbc:hive2://localhost:10511' -n 'testuser' -p ''
> If i try to hit that table
> select count(*) from test_table;
> I get the following error
> Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch 
> table test_table. java.security.AccessControlException: Permission denied: 
> user=hive, access=READ, 
> inode="/user/testuser/table/testtable":testuser:testuser:drwxr-x--x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6777)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6702)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:9529)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1516)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1433)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> (state=,code=0)
> I have the following in set in hive-site.xml so it should be using the 
> correct user.
> 
>   hive.server2.enable.doAs
>   true
> 
> 
>   hive.metastore.execute.setugi
>   true
> 
> 
> This works correctly in hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11245) Upgrade twitter4j to version 4.x

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11245:
--
Fix Version/s: (was: 1.6.0)

[~luciano resende] don't set Fix version or Target version.

> Upgrade twitter4j to version 4.x
> 
>
> Key: SPARK-11245
> URL: https://issues.apache.org/jira/browse/SPARK-11245
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Luciano Resende
>
> Twitter4J is already on 4.x release
> https://github.com/yusuke/twitter4j/releases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11252) ExternalShuffleClient should release connection after it had completed to fetch blocks from yarn's NameManager

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11252:
--
Component/s: YARN
 Shuffle

> ExternalShuffleClient should release connection after it had completed to 
> fetch blocks from yarn's NameManager
> --
>
> Key: SPARK-11252
> URL: https://issues.apache.org/jira/browse/SPARK-11252
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Reporter: Lianhui Wang
>
> ExternalShuffleClient of executors reserve its connection with yarn's 
> NodeManager until application has been completed. so it will make NodeManager 
> has many socket connections.
> in order to reduce network pressure of NodeManager's shuffleService, after 
> registerWithShuffleServer or fetchBlocks have been completed in 
> ExternalShuffleClient,  connection for shuffleService needs to be closed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11257) Spark dataframe negate filter conditions

2015-10-22 Thread Lokesh Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Kumar closed SPARK-11257.

Resolution: Not A Problem

There was no fix. See earlier comments.

> Spark dataframe negate filter conditions
> 
>
> Key: SPARK-11257
> URL: https://issues.apache.org/jira/browse/SPARK-11257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Fedora 21 core i5
>Reporter: Lokesh Kumar
>  Labels: bug
>
> I am trying to apply a negation of filter condition on the DataFrame as shown 
> below.
> !(`Ship Mode` LIKE '%Truck%')
> Which is throwing an exception below
> Exception in thread "main" java.lang.RuntimeException: [1.3] failure: 
> identifier expected
> (!(`Ship Mode` LIKE '%Truck%'))
>   ^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:47)
> at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:748)
> at Main.main(Main.java:73)
> Where as the same kind of negative filter conditions are working fine in 
> MySQL. Please find below
> mysql> select count(*) from audit_log where !(operation like '%Log%' or 
> operation like '%Proj%');
> +--+
> | count(*) |
> +--+
> |  129 |
> +--+
> 1 row in set (0.05 sec)
> Can anyone please let me know if this is planned to be fixed in Spark 
> DataFrames in future releases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11227) Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1

2015-10-22 Thread Yuri Saito (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969002#comment-14969002
 ] 

Yuri Saito commented on SPARK-11227:


[~ste...@apache.org]
But, same environments, spark1.4.0 run with successfully.

> Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1
> 
>
> Key: SPARK-11227
> URL: https://issues.apache.org/jira/browse/SPARK-11227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1
> Environment: OS: CentOS 6.6
> Memory: 28G
> CPU: 8
> Mesos: 0.22.0
> HDFS: Hadoop 2.6.0-CDH5.4.0 (build by Cloudera Manager)
>Reporter: Yuri Saito
>
> When running jar including Spark Job at HDFS HA Cluster, Mesos and 
> Spark1.5.1, the job throw Exception as "java.net.UnknownHostException: 
> nameservice1" and fail.
> I do below in Terminal.
> {code}
> /opt/spark/bin/spark-submit \
>   --class com.example.Job /jobs/job-assembly-1.0.0.jar
> {code}
> So, job throw below message.
> {code}
> 15/10/21 15:22:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
> (TID 0, spark003.example.com): java.lang.IllegalArgumentException: 
> java.net.UnknownHostException: nameservice1
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
> at 
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at scala.Option.map(Option.scala:145)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.UnknownHostException: nameservice1
> ... 41 more
> {code}
> But, I changed from Spark Cluster 1.5.1 to Spark Cluster 1.4.0, then run the 
> job, job complete with Success.
> In Addition, I disable High Availability on HDFS, then 

[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

2015-10-22 Thread HAN YE (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969011#comment-14969011
 ] 

HAN YE commented on SPARK-5594:
---

I don't know what your codes are.If your codes  run one StreamingContext and 
one SparkContext simultaneously. You had better use StreamingContext's 
SparkContext and don't create another new SparkContext. Otherwise It will cause 
this problem

> SparkException: Failed to get broadcast (TorrentBroadcast)
> --
>
> Key: SPARK-5594
> URL: https://issues.apache.org/jira/browse/SPARK-5594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: John Sandiford
>Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below 
> when running on a cluster (works locally), and have no idea what is causing 
> it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I 
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all 
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
> Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
> broadcast_6
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
> of broadcast_6
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
> ... 11 more
> {noformat}
> Driver stacktrace:
> {noformat}
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
> at 
> 

[jira] [Resolved] (SPARK-11121) Incorrect TaskLocation type

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11121.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9096
[https://github.com/apache/spark/pull/9096]

> Incorrect TaskLocation type
> ---
>
> Key: SPARK-11121
> URL: https://issues.apache.org/jira/browse/SPARK-11121
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: zhichao-li
>Priority: Minor
> Fix For: 1.6.0
>
>
> "toString" is the only difference between HostTaskLocation and 
> HDFSCacheTaskLocation for the moment, but it would be better to correct this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11121) Incorrect TaskLocation type

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11121:
--
Assignee: zhichao-li

> Incorrect TaskLocation type
> ---
>
> Key: SPARK-11121
> URL: https://issues.apache.org/jira/browse/SPARK-11121
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: zhichao-li
>Assignee: zhichao-li
>Priority: Minor
> Fix For: 1.6.0
>
>
> "toString" is the only difference between HostTaskLocation and 
> HDFSCacheTaskLocation for the moment, but it would be better to correct this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11208) Filter out 'hive.metastore.rawstore.impl' from executionHive temporary config

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11208:
--
Assignee: Artem Aliev

> Filter out 'hive.metastore.rawstore.impl' from executionHive temporary config
> -
>
> Key: SPARK-11208
> URL: https://issues.apache.org/jira/browse/SPARK-11208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Artem Aliev
>Assignee: Artem Aliev
> Fix For: 1.6.0
>
>
> Spark use two hive meta stores: external one for storing tables and internal 
> one (executionHive):
> {code}
> /**
> The copy of the hive client that is used for execution. Currently this must 
> always be
> Hive 13 as this is the version of Hive that is packaged with Spark SQL. This 
> copy of the
> client is used for execution related tasks like registering temporary 
> functions or ensuring
> that the ThreadLocal SessionState is correctly populated. This copy of Hive 
> is not used
> for storing persistent metadata, and only point to a dummy metastore in a 
> temporary directory. */
> {code}
> The executionHive assumed to be a standard meta store located in temporary 
> directory as a derby db. But hive.metastore.rawstore.impl was not filtered 
> out so any custom implementation of the metastore with other storage 
> properties (not JDO) will persist that temporary functions. 
> CassandraMetaStore from DataStax Enterprise is one of examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11257) Spark dataframe negate filter conditions

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-11257:
---

[~lokeshdotp] don't resolve these as "Fixed" since there was not a problem that 
was fixed by a particular change.

> Spark dataframe negate filter conditions
> 
>
> Key: SPARK-11257
> URL: https://issues.apache.org/jira/browse/SPARK-11257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Fedora 21 core i5
>Reporter: Lokesh Kumar
>  Labels: bug
> Fix For: 1.5.0
>
>
> I am trying to apply a negation of filter condition on the DataFrame as shown 
> below.
> !(`Ship Mode` LIKE '%Truck%')
> Which is throwing an exception below
> Exception in thread "main" java.lang.RuntimeException: [1.3] failure: 
> identifier expected
> (!(`Ship Mode` LIKE '%Truck%'))
>   ^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:47)
> at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:748)
> at Main.main(Main.java:73)
> Where as the same kind of negative filter conditions are working fine in 
> MySQL. Please find below
> mysql> select count(*) from audit_log where !(operation like '%Log%' or 
> operation like '%Proj%');
> +--+
> | count(*) |
> +--+
> |  129 |
> +--+
> 1 row in set (0.05 sec)
> Can anyone please let me know if this is planned to be fixed in Spark 
> DataFrames in future releases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11216) add encoder/decoder for external row

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11216:
--
Assignee: Wenchen Fan

> add encoder/decoder for external row
> 
>
> Key: SPARK-11216
> URL: https://issues.apache.org/jira/browse/SPARK-11216
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11257) Spark dataframe negate filter conditions

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11257:
--
Target Version/s:   (was: 1.5.1)
   Fix Version/s: (was: 1.5.0)

[~lokeshdotp] also do not set Target/Fix version. 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> Spark dataframe negate filter conditions
> 
>
> Key: SPARK-11257
> URL: https://issues.apache.org/jira/browse/SPARK-11257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Fedora 21 core i5
>Reporter: Lokesh Kumar
>  Labels: bug
>
> I am trying to apply a negation of filter condition on the DataFrame as shown 
> below.
> !(`Ship Mode` LIKE '%Truck%')
> Which is throwing an exception below
> Exception in thread "main" java.lang.RuntimeException: [1.3] failure: 
> identifier expected
> (!(`Ship Mode` LIKE '%Truck%'))
>   ^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:47)
> at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:748)
> at Main.main(Main.java:73)
> Where as the same kind of negative filter conditions are working fine in 
> MySQL. Please find below
> mysql> select count(*) from audit_log where !(operation like '%Log%' or 
> operation like '%Proj%');
> +--+
> | count(*) |
> +--+
> |  129 |
> +--+
> 1 row in set (0.05 sec)
> Can anyone please let me know if this is planned to be fixed in Spark 
> DataFrames in future releases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11257) Spark dataframe negate filter conditions

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11257.
---
Resolution: Not A Problem

> Spark dataframe negate filter conditions
> 
>
> Key: SPARK-11257
> URL: https://issues.apache.org/jira/browse/SPARK-11257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Fedora 21 core i5
>Reporter: Lokesh Kumar
>  Labels: bug
>
> I am trying to apply a negation of filter condition on the DataFrame as shown 
> below.
> !(`Ship Mode` LIKE '%Truck%')
> Which is throwing an exception below
> Exception in thread "main" java.lang.RuntimeException: [1.3] failure: 
> identifier expected
> (!(`Ship Mode` LIKE '%Truck%'))
>   ^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:47)
> at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:748)
> at Main.main(Main.java:73)
> Where as the same kind of negative filter conditions are working fine in 
> MySQL. Please find below
> mysql> select count(*) from audit_log where !(operation like '%Log%' or 
> operation like '%Proj%');
> +--+
> | count(*) |
> +--+
> |  129 |
> +--+
> 1 row in set (0.05 sec)
> Can anyone please let me know if this is planned to be fixed in Spark 
> DataFrames in future releases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11257) Spark dataframe negate filter conditions

2015-10-22 Thread Lokesh Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968976#comment-14968976
 ] 

Lokesh Kumar commented on SPARK-11257:
--

Sorry using jira for the first time.

> Spark dataframe negate filter conditions
> 
>
> Key: SPARK-11257
> URL: https://issues.apache.org/jira/browse/SPARK-11257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Fedora 21 core i5
>Reporter: Lokesh Kumar
>  Labels: bug
>
> I am trying to apply a negation of filter condition on the DataFrame as shown 
> below.
> !(`Ship Mode` LIKE '%Truck%')
> Which is throwing an exception below
> Exception in thread "main" java.lang.RuntimeException: [1.3] failure: 
> identifier expected
> (!(`Ship Mode` LIKE '%Truck%'))
>   ^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:47)
> at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:748)
> at Main.main(Main.java:73)
> Where as the same kind of negative filter conditions are working fine in 
> MySQL. Please find below
> mysql> select count(*) from audit_log where !(operation like '%Log%' or 
> operation like '%Proj%');
> +--+
> | count(*) |
> +--+
> |  129 |
> +--+
> 1 row in set (0.05 sec)
> Can anyone please let me know if this is planned to be fixed in Spark 
> DataFrames in future releases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11257) Spark dataframe negate filter conditions

2015-10-22 Thread Lokesh Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lokesh Kumar reopened SPARK-11257:
--

> Spark dataframe negate filter conditions
> 
>
> Key: SPARK-11257
> URL: https://issues.apache.org/jira/browse/SPARK-11257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: Fedora 21 core i5
>Reporter: Lokesh Kumar
>  Labels: bug
>
> I am trying to apply a negation of filter condition on the DataFrame as shown 
> below.
> !(`Ship Mode` LIKE '%Truck%')
> Which is throwing an exception below
> Exception in thread "main" java.lang.RuntimeException: [1.3] failure: 
> identifier expected
> (!(`Ship Mode` LIKE '%Truck%'))
>   ^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.SqlParser.parseExpression(SqlParser.scala:47)
> at org.apache.spark.sql.DataFrame.filter(DataFrame.scala:748)
> at Main.main(Main.java:73)
> Where as the same kind of negative filter conditions are working fine in 
> MySQL. Please find below
> mysql> select count(*) from audit_log where !(operation like '%Log%' or 
> operation like '%Proj%');
> +--+
> | count(*) |
> +--+
> |  129 |
> +--+
> 1 row in set (0.05 sec)
> Can anyone please let me know if this is planned to be fixed in Spark 
> DataFrames in future releases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11247) Back to master always try to use fqdn node name

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11247:
--
Fix Version/s: (was: 1.6.0)

> Back to master always try to use fqdn node name
> ---
>
> Key: SPARK-11247
> URL: https://issues.apache.org/jira/browse/SPARK-11247
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Luciano Resende
>
> In a standalone spark deployment
> Access spark master using ip address of the machine (e.g. DNS is not setup)
> Click on a given worker
> Click back to master
> Result 404
> The UI is always trying to resolve the FQDN of the node, instead of using the 
> provided IP which is giving 404 because there is no DNS enabled for the node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11247) Back to master always try to use fqdn node name

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11247.
---
Resolution: Not A Problem

> Back to master always try to use fqdn node name
> ---
>
> Key: SPARK-11247
> URL: https://issues.apache.org/jira/browse/SPARK-11247
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Luciano Resende
> Fix For: 1.6.0
>
>
> In a standalone spark deployment
> Access spark master using ip address of the machine (e.g. DNS is not setup)
> Click on a given worker
> Click back to master
> Result 404
> The UI is always trying to resolve the FQDN of the node, instead of using the 
> provided IP which is giving 404 because there is no DNS enabled for the node.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8048) Explicit partitionning of an RDD with 0 partition will yield empty outer join

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8048.
--
Resolution: Duplicate

> Explicit partitionning of an RDD with 0 partition will yield empty outer join
> -
>
> Key: SPARK-8048
> URL: https://issues.apache.org/jira/browse/SPARK-8048
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1
>Reporter: Olivier Toupin
>Priority: Minor
>
> Check this code =>
> https://gist.github.com/anonymous/0f935915f2bc182841f0
> Because of this => {{.partitionBy(new HashPartitioner(0))}}
> The join will return empty result.
> Here a normal expected behaviour would the join to crash, cause error, or to 
> return unjoined results, but instead will yield an empty RDD.
> This a trivial exemple, but imagine: 
> {{.partitionBy(new HashPartitioner(previous.partitions.length))}}. 
> You join on an empty "previous" rdd, the lookup table is empty, Spark will 
> you lose all your results, instead of returning unjoined results, and this 
> without warnings or errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11250) Generate different alias for columns with same name during join

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11250:
--
Component/s: SQL

> Generate different alias for columns with same name during join
> ---
>
> Key: SPARK-11250
> URL: https://issues.apache.org/jira/browse/SPARK-11250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Narine Kokhlikyan
>
> It's confusing to see columns with same name after joining, and hard to 
> access them, we could generate different alias for them in joined DataFrame.
> see https://github.com/apache/spark/pull/9012/files#r42696855 as example



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11213) Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11213.
---
Resolution: Invalid

[~ankit30988] do *not* reopen an issue unless something has materially changed 
in the conversation. Here you are still asking questions -- please use 
u...@spark.apache.org. 

> Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4
> ---
>
> Key: SPARK-11213
> URL: https://issues.apache.org/jira/browse/SPARK-11213
> Project: Spark
>  Issue Type: Bug
>Reporter: Ankit
>
> Hello Guys,
> We have a Cloudera Dist 5.4 ad it has spark 1.3 version 
> Issue 
> we have data sciencetis work on R Script so was searching a ways to submit a 
> r script using ozie or local spark submit to a remoter Yarn resource manager 
> can anyone share the steps to do the same it really difficult to guess the 
> steps , 
> Thanks in advance 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11213) Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4

2015-10-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968969#comment-14968969
 ] 

Sean Owen edited comment on SPARK-11213 at 10/22/15 11:15 AM:
--

[~AnkitKhare] do *not* reopen an issue unless something has materially changed 
in the conversation. Here you are still asking questions -- please use 
u...@spark.apache.org. 


was (Author: srowen):
[~ankit30988] do *not* reopen an issue unless something has materially changed 
in the conversation. Here you are still asking questions -- please use 
u...@spark.apache.org. 

> Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4
> ---
>
> Key: SPARK-11213
> URL: https://issues.apache.org/jira/browse/SPARK-11213
> Project: Spark
>  Issue Type: Bug
>Reporter: Ankit
>
> Hello Guys,
> We have a Cloudera Dist 5.4 ad it has spark 1.3 version 
> Issue 
> we have data sciencetis work on R Script so was searching a ways to submit a 
> r script using ozie or local spark submit to a remoter Yarn resource manager 
> can anyone share the steps to do the same it really difficult to guess the 
> steps , 
> Thanks in advance 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9735) Auto infer partition schema of HadoopFsRelation should should respected the user specified one

2015-10-22 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-9735.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8026
[https://github.com/apache/spark/pull/8026]

> Auto infer partition schema of HadoopFsRelation should should respected the 
> user specified one
> --
>
> Key: SPARK-9735
> URL: https://issues.apache.org/jira/browse/SPARK-9735
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
> Fix For: 1.6.0
>
>
> This code is copied from the hadoopFsRelationSuite.scala
> {code}
> partitionedTestDF = (for {
> i <- 1 to 3
> p2 <- Seq("foo", "bar")
>   } yield (i, s"val_$i", 1, p2)).toDF("a", "b", "p1", "p2")
> withTempPath { file =>
>   val input = partitionedTestDF.select('a, 'b, 
> 'p1.cast(StringType).as('ps), 'p2)
>   input
> .write
> .format(dataSourceName)
> .mode(SaveMode.Overwrite)
> .partitionBy("ps", "p2")
> .saveAsTable("t")
>   input
> .write
> .format(dataSourceName)
> .mode(SaveMode.Append)
> .partitionBy("ps", "p2")
> .saveAsTable("t")
>   val realData = input.collect()
>   withTempTable("t") {
> checkAnswer(sqlContext.table("t"), realData ++ realData)
>   }
> }
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220)
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
>   at 
> org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 07:44:01.344 ERROR org.apache.spark.executor.Executor: Exception in task 14.0 
> in stage 3.0 (TID 206)
> java.lang.ClassCastException: java.lang.Integer cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
>   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220)
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62)
>   at 
> 

[jira] [Reopened] (SPARK-11229) NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0

2015-10-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reopened SPARK-11229:
--

> NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0
> -
>
> Key: SPARK-11229
> URL: https://issues.apache.org/jira/browse/SPARK-11229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: 14.04.1-Ubuntu SMP x86_64 GNU/Linux
>Reporter: Romi Kuntsman
> Fix For: 1.6.0
>
>
> Steps to reproduce:
> 1. set spark.shuffle.memoryFraction=0
> 2. load dataframe from parquet file
> 3. see it's read correctly by calling dataframe.show()
> 4. call dataframe.count()
> Expected behaviour:
> get count of rows in dataframe
> OR, if memoryFraction=0 is an invalid setting, get notified about it
> Actual behaviour:
> CatalystReadSupport doesn't read the schema (even thought there is one) and 
> then there's a NullPointerException.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
>   at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
>   at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
>   ... 14 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:194)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:192)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:368)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
>   at 
> 

[jira] [Closed] (SPARK-11229) NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0

2015-10-22 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust closed SPARK-11229.

   Resolution: Fixed
Fix Version/s: 1.6.0

> NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0
> -
>
> Key: SPARK-11229
> URL: https://issues.apache.org/jira/browse/SPARK-11229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: 14.04.1-Ubuntu SMP x86_64 GNU/Linux
>Reporter: Romi Kuntsman
> Fix For: 1.6.0
>
>
> Steps to reproduce:
> 1. set spark.shuffle.memoryFraction=0
> 2. load dataframe from parquet file
> 3. see it's read correctly by calling dataframe.show()
> 4. call dataframe.count()
> Expected behaviour:
> get count of rows in dataframe
> OR, if memoryFraction=0 is an invalid setting, get notified about it
> Actual behaviour:
> CatalystReadSupport doesn't read the schema (even thought there is one) and 
> then there's a NullPointerException.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
>   at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
>   at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
>   ... 14 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:194)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:192)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:368)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
>   at 
> 

[jira] [Resolved] (SPARK-11116) Initial API Draft

2015-10-22 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Initial API Draft
> -
>
> Key: SPARK-6
> URL: https://issues.apache.org/jira/browse/SPARK-6
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.6.0
>
>
> The goal here is to spec out the main functions to give people an idea of 
> what using the API would be like.  Optimization and whatnot can be done in a 
> follow up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11269) Java API support & test cases

2015-10-22 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11269:
---

 Summary: Java API support & test cases
 Key: SPARK-11269
 URL: https://issues.apache.org/jira/browse/SPARK-11269
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11267) NettyRpcEnv and sparkDriver services report the same port in the logs

2015-10-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969923#comment-14969923
 ] 

Sean Owen commented on SPARK-11267:
---

Huh that does look weird. [~zsxwing] does this somehow have to do with the 
changes to the driver RPC service, like somehow they are the same service? I 
also don't see how two of anything could listen on the same port.

> NettyRpcEnv and sparkDriver services report the same port in the logs
> -
>
> Key: SPARK-11267
> URL: https://issues.apache.org/jira/browse/SPARK-11267
> Project: Spark
>  Issue Type: Bug
> Environment: the version built from today's sources - Spark version 
> 1.6.0-SNAPSHOT
>Reporter: Jacek Laskowski
>Priority: Minor
>
> When starting {{./bin/spark-shell --conf spark.driver.port=}} Spark 
> reports two services - NettyRpcEnv and sparkDriver - using the same {{}} 
> port:
> {code}
> 15/10/22 23:09:32 INFO SparkContext: Running Spark version 1.6.0-SNAPSHOT
> 15/10/22 23:09:32 INFO SparkContext: Spark configuration:
> spark.app.name=Spark shell
> spark.driver.port=
> spark.home=/Users/jacek/dev/oss/spark
> spark.jars=
> spark.logConf=true
> spark.master=local[*]
> spark.repl.class.uri=http://192.168.1.4:52645
> spark.submit.deployMode=client
> ...
> 15/10/22 23:09:33 INFO Utils: Successfully started service 'NettyRpcEnv' on 
> port .
> ...
> 15/10/22 23:09:33 INFO Utils: Successfully started service 'sparkDriver' on 
> port .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster

2015-10-22 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-11265:
---
Summary: YarnClient can't get tokens to talk to Hive in a secure cluster  
(was: YarnClient cant get tokens to talk to Hive in a secure cluster)

> YarnClient can't get tokens to talk to Hive in a secure cluster
> ---
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails. This appears to be because the 
> constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was 
> made private and replaced with a factory method. The YARN client uses 
> reflection to get the tokens, so the signature changes weren't picked up in 
> SPARK-8064.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11268) Non-daemon startup scripts

2015-10-22 Thread Simon Hafner (JIRA)
Simon Hafner created SPARK-11268:


 Summary: Non-daemon startup scripts
 Key: SPARK-11268
 URL: https://issues.apache.org/jira/browse/SPARK-11268
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Simon Hafner


The current submit scripts fork and write the logs to /var/log/spark. It would 
be nice to have an option to just exec the process and log to stdout, so 
systemd can collect the logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7021) JUnit output for Python tests

2015-10-22 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-7021.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8323
[https://github.com/apache/spark/pull/8323]

> JUnit output for Python tests
> -
>
> Key: SPARK-7021
> URL: https://issues.apache.org/jira/browse/SPARK-7021
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Brennon York
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>
> Currently python returns its test output in its own format. What would be 
> preferred is if the Python test runner could output its test results in JUnit 
> format to better match the rest of the Jenkins test output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10812) Spark Hadoop Util does not support stopping a non-yarn Spark Context & starting a Yarn spark context.

2015-10-22 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-10812:
---
Fix Version/s: 1.5.2

Backported to branch-1.5 (clean merge) to fix SPARK-11201.

> Spark Hadoop Util does not support stopping a non-yarn Spark Context & 
> starting a Yarn spark context.
> -
>
> Key: SPARK-10812
> URL: https://issues.apache.org/jira/browse/SPARK-10812
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: holdenk
>Assignee: Holden Karau
>Priority: Minor
> Fix For: 1.5.2, 1.6.0
>
>
> While this is likely not a huge issue for real production systems, for test 
> systems which may setup a Spark Context and tear it down and stand up a Spark 
> Context with a different master (e.g. some local mode & some yarn mode) tests 
> this cane be an issue. Discovered during work on spark-testing-base on Spark 
> 1.4.1, but seems like the logic that triggers it is present in master (see 
> SparkHadoopUtil object). A valid work around for users encountering this 
> issue is to fork a different JVM, however this can be heavy weight.
> {quote}
> [info] SampleMiniClusterTest:
> [info] Exception encountered when attempting to run a suite with class name: 
> com.holdenkarau.spark.testing.SampleMiniClusterTest *** ABORTED ***
> [info]   java.lang.ClassCastException: 
> org.apache.spark.deploy.SparkHadoopUtil cannot be cast to 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
> [info]   at 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:163)
> [info]   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:257)
> [info]   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
> [info]   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
> [info]   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
> [info]   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
> [info]   at org.apache.spark.SparkContext.(SparkContext.scala:497)
> [info]   at 
> com.holdenkarau.spark.testing.SharedMiniCluster$class.setup(SharedMiniCluster.scala:186)
> [info]   at 
> com.holdenkarau.spark.testing.SampleMiniClusterTest.setup(SampleMiniClusterTest.scala:26)
> [info]   at 
> com.holdenkarau.spark.testing.SharedMiniCluster$class.beforeAll(SharedMiniCluster.scala:103)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10024) Python API RF and GBT related params clear up

2015-10-22 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10024:


Assignee: Apache Spark

> Python API RF and GBT related params clear up
> -
>
> Key: SPARK-10024
> URL: https://issues.apache.org/jira/browse/SPARK-10024
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> Implement "RandomForestParams", "GBTParams" and "TreeEnsembleParams" for 
> Python API, and make corresponding parameters in place. There are lots of 
> duplicated code in the current implementation. You can refer the Scala API 
> which is more compact. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10024) Python API RF and GBT related params clear up

2015-10-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969945#comment-14969945
 ] 

Apache Spark commented on SPARK-10024:
--

User 'vectorijk' has created a pull request for this issue:
https://github.com/apache/spark/pull/9233

> Python API RF and GBT related params clear up
> -
>
> Key: SPARK-10024
> URL: https://issues.apache.org/jira/browse/SPARK-10024
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Yanbo Liang
>
> Implement "RandomForestParams", "GBTParams" and "TreeEnsembleParams" for 
> Python API, and make corresponding parameters in place. There are lots of 
> duplicated code in the current implementation. You can refer the Scala API 
> which is more compact. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11265) YarnClient cant get tokens to talk to Hive in a secure cluster

2015-10-22 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969818#comment-14969818
 ] 

Steve Loughran commented on SPARK-11265:


Initial report from  Chester Chen

{noformat}

  This is tested against the 

   spark 1.5.1 ( branch 1.5  with label 1.5.2-SNAPSHOT with commit on Tue Oct 
6, 84f510c4fa06e43bd35e2dc8e1008d0590cbe266)  

   Spark deployment mode : Spark-Cluster

   Notice that if we enable Kerberos mode, the spark yarn client fails with the 
following: 

Could not initialize class org.apache.hadoop.hive.ql.metadata.Hive
java.lang.NoClassDefFoundError: Could not initialize class 
org.apache.hadoop.hive.ql.metadata.Hive
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.Client$.org$apache$spark$deploy$yarn$Client$$obtainTokenForHiveMetastore(Client.scala:1252)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:271)
at 
org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
at 
org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)


Diving in Yarn Client.scala code and tested against different dependencies and 
notice the followings:  if  the kerberos mode is enabled, 
Client.obtainTokenForHiveMetastore() will try to use scala reflection to get 
Hive and HiveConf and method on these method. 
 
  val hiveClass = 
mirror.classLoader.loadClass("org.apache.hadoop.hive.ql.metadata.Hive")
  val hive = hiveClass.getMethod("get").invoke(null)

  val hiveConf = hiveClass.getMethod("getConf").invoke(hive)
  val hiveConfClass = 
mirror.classLoader.loadClass("org.apache.hadoop.hive.conf.HiveConf")

  val hiveConfGet = (param: String) => Option(hiveConfClass
.getMethod("get", classOf[java.lang.String])
.invoke(hiveConf, param))

   If the "org.spark-project.hive" % "hive-exec" % "1.2.1.spark" is used, then 
you will get above exception. But if we use the 
   "org.apache.hive" % "hive-exec" "0.13.1-cdh5.2.0" 
 The above method will not throw exception. 
{noformat}

> YarnClient cant get tokens to talk to Hive in a secure cluster
> --
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails. This appears to be because the 
> constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was 
> made private and replaced with a factory method. The YARN client uses 
> reflection to get the tokens, so the signature changes weren't picked up in 
> SPARK-8064.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11266) Peak memory tests swallow failures

2015-10-22 Thread Andrew Or (JIRA)
Andrew Or created SPARK-11266:
-

 Summary: Peak memory tests swallow failures
 Key: SPARK-11266
 URL: https://issues.apache.org/jira/browse/SPARK-11266
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.5.0
Reporter: Andrew Or
Priority: Critical


You have something like the following without the tests failing:
{code}
22:29:03.493 ERROR org.apache.spark.scheduler.LiveListenerBus: Listener 
SaveInfoListener threw an exception
org.scalatest.exceptions.TestFailedException: peak execution memory accumulator 
not set in 'aggregation with codegen'
at 
org.apache.spark.AccumulatorSuite$$anonfun$verifyPeakExecutionMemorySet$1$$anonfun$27.apply(AccumulatorSuite.scala:340)
at 
org.apache.spark.AccumulatorSuite$$anonfun$verifyPeakExecutionMemorySet$1$$anonfun$27.apply(AccumulatorSuite.scala:340)
at scala.Option.getOrElse(Option.scala:120)
{code}

E.g. 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1936/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11242) In conf/spark-env.sh.template SPARK_DRIVER_MEMORY is documented incorrectly

2015-10-22 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11242.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9201
[https://github.com/apache/spark/pull/9201]

> In conf/spark-env.sh.template SPARK_DRIVER_MEMORY is documented incorrectly
> ---
>
> Key: SPARK-11242
> URL: https://issues.apache.org/jira/browse/SPARK-11242
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.1
>Reporter: Xiu
>Priority: Trivial
> Fix For: 1.6.0
>
>
> In conf/spark-env.sh.template
> https://github.com/apache/spark/blob/master/conf/spark-env.sh.template#L42
> # - SPARK_DRIVER_MEMORY, Memory for Master (e.g. 1000M, 2G) (Default: 1G)
> SPARK_DRIVER_MEMORY is memory config for driver, not master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >