[jira] [Updated] (SPARK-12753) Import error during unit test while calling a function from reduceByKey()

2016-01-11 Thread Dat Tran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dat Tran updated SPARK-12753:
-
Description: 
The current directory structure for my test script is as follows:
project/
   script/
  __init__.py 
  map.py
   test/
 __init.py__
 test_map.py

I have attached map.py and test_map.py file with this issue. 

When I run the nosetest in the test directory, the test fails. I get no module 
named "script" found error. 
However when I modify the map_add function to replace the call to add within 
reduceByKey in map.py like this:

def map_add(df):
result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: x+y)
return result

The test passes.

Also, when I run the original test_map.py from the project directory, the test 
passes. 

I am not able to figure out why the test doesn't detect the script module when 
it is within the test directory. 

I have also attached the log error file. Any help will be much appreciated.

  was:
The current directory structure for my test script is as follows:
project/
  script/
 __init__.py 
 map.py
  test/
__init.py__
test_map.py

I have attached map.py and test_map.py file with this issue. 

When I run the nosetest in the test directory, the test fails. I get no module 
named "script" found error. 
However when I modify the map_add function to replace the call to add within 
reduceByKey in map.py like this:

def map_add(df):
result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: x+y)
return result

The test passes.

Also, when I run the original test_map.py from the project directory, the test 
passes. 

I am not able to figure out why the test doesn't detect the script module when 
it is within the test directory. 

I have also attached the log error file. Any help will be much appreciated.


> Import error during unit test while calling a function from reduceByKey()
> -
>
> Key: SPARK-12753
> URL: https://issues.apache.org/jira/browse/SPARK-12753
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: El Capitan, Single cluster Hadoop, Python 3, Spark 1.6, 
> Anaconda 
>Reporter: Dat Tran
>Priority: Trivial
>  Labels: pyspark, python3, unit-test
> Attachments: log.txt, map.py, test_map.py
>
>
> The current directory structure for my test script is as follows:
> project/
>script/
>   __init__.py 
>   map.py
>test/
>  __init.py__
>  test_map.py
> I have attached map.py and test_map.py file with this issue. 
> When I run the nosetest in the test directory, the test fails. I get no 
> module named "script" found error. 
> However when I modify the map_add function to replace the call to add within 
> reduceByKey in map.py like this:
> def map_add(df):
> result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: 
> x+y)
> return result
> The test passes.
> Also, when I run the original test_map.py from the project directory, the 
> test passes. 
> I am not able to figure out why the test doesn't detect the script module 
> when it is within the test directory. 
> I have also attached the log error file. Any help will be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12753) Import error during unit test while calling a function from reduceByKey()

2016-01-11 Thread Dat Tran (JIRA)
Dat Tran created SPARK-12753:


 Summary: Import error during unit test while calling a function 
from reduceByKey()
 Key: SPARK-12753
 URL: https://issues.apache.org/jira/browse/SPARK-12753
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 1.6.0
 Environment: El Capitan, Single cluster Hadoop, Python 3, Spark 1.6, 
Anaconda 
Reporter: Dat Tran
Priority: Trivial


The current directory structure for my test script is as follows:
project/
  script/
 __init__.py 
 map.py
  test/
__init.py__
test_map.py

I have attached map.py and test_map.py file with this issue. 

When I run the nosetest in the test directory, the test fails. I get no module 
named "script" found error. 
However when I modify the map_add function to replace the call to add within 
reduceByKey in map.py like this:

def map_add(df):
result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: x+y)
return result

The test passes.

Also, when I run the original test_map.py from the project directory, the test 
passes. 

I am not able to figure out why the test doesn't detect the script module when 
it is within the test directory. 

I have also attached the log error file. Any help will be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12754) Data type mismatch on two array values when using filter/where

2016-01-11 Thread Jesse English (JIRA)
Jesse English created SPARK-12754:
-

 Summary: Data type mismatch on two array values when using 
filter/where
 Key: SPARK-12754
 URL: https://issues.apache.org/jira/browse/SPARK-12754
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0, 1.5.0
 Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.5.0+
Reporter: Jesse English


The following test produces the error _org.apache.spark.sql.AnalysisException: 
cannot resolve '(point = array(0,9))' due to data type mismatch: differing 
types in '(point = array(0,9))' (array and array)_

This is not the case on 1.4.x, but has been introduced with 1.5+.  Is there a 
preferred method for making this sort of arbitrarily sized array comparison?

{code:title=test.scala}
test("test array comparison") {

val vectors: Vector[Row] =  Vector(
  Row.fromTuple("id_1" -> Array(0L, 2L)),
  Row.fromTuple("id_2" -> Array(0L, 5L)),
  Row.fromTuple("id_3" -> Array(0L, 9L)),
  Row.fromTuple("id_4" -> Array(1L, 0L)),
  Row.fromTuple("id_5" -> Array(1L, 8L)),
  Row.fromTuple("id_6" -> Array(2L, 4L)),
  Row.fromTuple("id_7" -> Array(5L, 6L)),
  Row.fromTuple("id_8" -> Array(6L, 2L)),
  Row.fromTuple("id_9" -> Array(7L, 0L))
)
val data: RDD[Row] = sc.parallelize(vectors, 3)

val schema = StructType(
  StructField("id", StringType, false) ::
StructField("point", DataTypes.createArrayType(LongType), false) ::
Nil
)

val sqlContext = new SQLContext(sc)
var dataframe = sqlContext.createDataFrame(data, schema)

val  targetPoint:Array[Long] = Array(0L,9L)

//This is the line where it fails
//org.apache.spark.sql.AnalysisException: cannot resolve 
// '(point = array(0,9))' due to data type mismatch:
// differing types in '(point = array(0,9))' 
// (array and array).

val targetRow = dataframe.where(dataframe("point") === 
array(targetPoint.map(value => lit(value)): _*)).first()

assert(targetRow != null)
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12753) Import error during unit test while calling a function from reduceByKey()

2016-01-11 Thread Dat Tran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dat Tran updated SPARK-12753:
-
Attachment: map.py

> Import error during unit test while calling a function from reduceByKey()
> -
>
> Key: SPARK-12753
> URL: https://issues.apache.org/jira/browse/SPARK-12753
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: El Capitan, Single cluster Hadoop, Python 3, Spark 1.6, 
> Anaconda 
>Reporter: Dat Tran
>Priority: Trivial
>  Labels: pyspark, python3, unit-test
> Attachments: map.py
>
>
> The current directory structure for my test script is as follows:
> project/
>   script/
>  __init__.py 
>  map.py
>   test/
> __init.py__
> test_map.py
> I have attached map.py and test_map.py file with this issue. 
> When I run the nosetest in the test directory, the test fails. I get no 
> module named "script" found error. 
> However when I modify the map_add function to replace the call to add within 
> reduceByKey in map.py like this:
> def map_add(df):
> result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: 
> x+y)
> return result
> The test passes.
> Also, when I run the original test_map.py from the project directory, the 
> test passes. 
> I am not able to figure out why the test doesn't detect the script module 
> when it is within the test directory. 
> I have also attached the log error file. Any help will be much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore

2016-01-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092427#comment-15092427
 ] 

Yin Huai commented on SPARK-12403:
--

[~lunendl] Also, have you reported to simba? If there is any public page that 
can tracks that issue, it will be good to post it at here.

(btw, from the error message, looks like the odbc driver got the wrong database 
name. I am not sure if it is a problem of the odbc driver or the spark sql's 
thrift server. We will try to investigate when we get a chance.)

> "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
> 
>
> Key: SPARK-12403
> URL: https://issues.apache.org/jira/browse/SPARK-12403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: ODBC connector query 
>Reporter: Lunen
>
> We are unable to query the SPARK tables using the ODBC driver from Simba 
> Spark(Databricks - "Simba Spark ODBC Driver 1.0")  We are able to do a show 
> databases and show tables, but not any queries. eg.
> Working:
> Select * from openquery(SPARK,'SHOW DATABASES')
> Select * from openquery(SPARK,'SHOW TABLES')
> Not working:
> Select * from openquery(SPARK,'Select * from lunentest')
> The error I get is:
> OLE DB provider "MSDASQL" for linked server "SPARK" returned message 
> "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest".
> Msg 7321, Level 16, State 2, Line 2
> An error occurred while preparing the query "Select * from lunentest" for 
> execution against OLE DB provider "MSDASQL" for linked server "SPARK"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12646) Support _HOST in kerberos principal for connecting to secure cluster

2016-01-11 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092460#comment-15092460
 ] 

Marcelo Vanzin commented on SPARK-12646:


Can you convince people to at least use proper credentials to launch the Spark 
jobs instead of reusing YARN's?

I'm a little wary of adding this feature just to support a broken use case. 
When running on YARN, Spark is a user application, and you're asking for Spark 
to authenticate using service principals. That's kinda wrong, even if it works.

Your code also has a huge problem in that it uses {{InetAddress.getLocalHost}}; 
even if this were a desirable feature, there's no guarantee that's the correct 
host to use at all. On multi-homed machines, for example, which should be the 
address to use when expanding the principal template?

You application can also login to kerberos before launching the Spark job; call 
kinit by yourself and then launch Spark without using "--principal" nor 
"--keytab". Then Spark doesn't need to do anything, it just inherits the 
kerberos ticket from your app.

> Support _HOST in kerberos principal for connecting to secure cluster
> 
>
> Key: SPARK-12646
> URL: https://issues.apache.org/jira/browse/SPARK-12646
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Hari Krishna Dara
>Priority: Minor
>  Labels: security
>
> Hadoop supports _HOST as a token that is dynamically replaced with the actual 
> hostname at the time the kerberos authentication is done. This is supported 
> in many hadoop stacks including YARN. When configuring Spark to connect to 
> secure cluster (e.g., yarn-cluster or yarn-client as master), it would be 
> natural to extend support for this token to Spark as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4389) Set akka.remote.netty.tcp.bind-hostname="0.0.0.0" so driver can be located behind NAT

2016-01-11 Thread Alan Braithwaite (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092481#comment-15092481
 ] 

Alan Braithwaite commented on SPARK-4389:
-

So is there any hope for running spark behind a transparent proxy then?  What 
is the preferred method for running a spark-master in an environment where 
things get dynamically scheduled (mesos+marathon, kubernetes, etc)?

> Set akka.remote.netty.tcp.bind-hostname="0.0.0.0" so driver can be located 
> behind NAT
> -
>
> Key: SPARK-4389
> URL: https://issues.apache.org/jira/browse/SPARK-4389
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Josh Rosen
>Priority: Minor
>
> We should set {{akka.remote.netty.tcp.bind-hostname="0.0.0.0"}} in our Akka 
> configuration so that Spark drivers can be located behind NATs / work with 
> weird DNS setups.
> This is blocked by upgrading our Akka version, since this configuration is 
> not present Akka 2.3.4.  There might be a different approach / workaround 
> that works on our current Akka version, though.
> EDIT: this is blocked by Akka 2.4, since this feature is only available in 
> the 2.4 snapshot release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore

2016-01-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092425#comment-15092425
 ] 

Yin Huai commented on SPARK-12403:
--

[~lunendl] Can you try to add db name to the from clause and see if you can 
workaround the issue (using {{Select * from openquery(SPARK,'Select * from 
yourDBName.lunentest')}})?

> "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
> 
>
> Key: SPARK-12403
> URL: https://issues.apache.org/jira/browse/SPARK-12403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: ODBC connector query 
>Reporter: Lunen
>
> We are unable to query the SPARK tables using the ODBC driver from Simba 
> Spark(Databricks - "Simba Spark ODBC Driver 1.0")  We are able to do a show 
> databases and show tables, but not any queries. eg.
> Working:
> Select * from openquery(SPARK,'SHOW DATABASES')
> Select * from openquery(SPARK,'SHOW TABLES')
> Not working:
> Select * from openquery(SPARK,'Select * from lunentest')
> The error I get is:
> OLE DB provider "MSDASQL" for linked server "SPARK" returned message 
> "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest".
> Msg 7321, Level 16, State 2, Line 2
> An error occurred while preparing the query "Select * from lunentest" for 
> execution against OLE DB provider "MSDASQL" for linked server "SPARK"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.

2016-01-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092445#comment-15092445
 ] 

Jean-Baptiste Onofré commented on SPARK-12430:
--

I think it's related to this commit:

{code}
52f5754 Marcelo Vanzin on 1/21/15 at 11:38 PM (committed by Josh Rosen on 
2/2/15 at 11:01 PM)
Make sure only owner can read / write to directories created for the job.
Whenever a directory is created by the utility method, immediately restrict
its permissions so that only the owner has access to its contents.
Signed-off-by: Josh Rosen 
{code}

As it can be checked with the extras/java8-test, I will verify.

Sorry for the delay, I keep you posted.

> Temporary folders do not get deleted after Task completes causing problems 
> with disk space.
> ---
>
> Key: SPARK-12430
> URL: https://issues.apache.org/jira/browse/SPARK-12430
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1, 1.5.2
> Environment: Ubuntu server
>Reporter: Fede Bar
>
> We are experiencing an issue with automatic /tmp folder deletion after 
> framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as 
> Spark 1.5.1) over Mesos will not delete some temporary folders causing free 
> disk space on server to exhaust. 
> Behavior of M/R job using Spark 1.4.1 over Mesos cluster:
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/*  , 
>  */tmp/spark-#/blockmgr-#*
> - When task is completed */tmp/spark-#/* gets deleted along with 
> */tmp/spark-#/blockmgr-#* sub-folder.
> Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job):
> - Launched using spark-submit on one cluster node.
> - Following folders are created: */tmp/mesos/mesos/slaves/id** * , 
> */tmp/spark-***/ *  ,{color:red} /tmp/blockmgr-***{color}
> - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle 
> container folder {color:red} /tmp/blockmgr-***{color}
> Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several 
> GB depending on the job that ran. Over time this causes disk space to become 
> full with consequences that we all know. 
> Running a shell script would probably work but it is difficult to identify 
> folders in use by a running M/R or stale folders. I did notice similar issues 
> opened by other users marked as "resolved", but none seems to exactly match 
> the above behavior. 
> I really hope someone has insights on how to fix it.
> Thank you very much!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12755) Spark may attempt to rebuild application UI before finishing writing the event logs in possible race condition

2016-01-11 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092489#comment-15092489
 ] 

Michael Allman commented on SPARK-12755:


I'm going to put together a PR that simply reorders the call to stop the event 
logger so that it comes before the call to stop the DAG scheduler.

> Spark may attempt to rebuild application UI before finishing writing the 
> event logs in possible race condition
> --
>
> Key: SPARK-12755
> URL: https://issues.apache.org/jira/browse/SPARK-12755
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Michael Allman
>Priority: Minor
>
> As reported in SPARK-6950, it appears that sometimes the standalone master 
> attempts to build an application's historical UI before closing the app's 
> event log. This is still an issue for us in 1.5.2+, and I believe I've found 
> the underlying cause.
> When stopping a {{SparkContext}}, the {{stop}} method stops the DAG scheduler:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> and then stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727
> Though it is difficult to follow the chain of events, one of the sequelae of 
> stopping the DAG scheduler is that the master's {{rebuildSparkUI}} method is 
> called. This method looks for the application's event logs, and its behavior 
> varies based on the existence of an {{.inprogress}} file suffix. In 
> particular, a warning is logged if this suffix exists:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L935
> After calling the {{stop}} method on the DAG scheduler, the {{SparkContext}} 
> stops the event logger:
> https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1734-L1736
> This renames the event log, dropping the {{.inprogress}} file sequence.
> As such, a race condition exists where the master may attempt to process the 
> application log file before finalizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-11 Thread Mark Grover (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092455#comment-15092455
 ] 

Mark Grover commented on SPARK-12177:
-

Thanks Nikita. And, I will be issuing PR's to your kafka09-integration branch 
so it can become the single source of truth until this change gets merged into 
spark. And, I believe Spark community prefers discussion on PRs once they are 
filed, so you'll hear more from me there:-)

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12744) Inconsistent behavior parsing JSON with unix timestamp values

2016-01-11 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12744:
-
Assignee: Anatoliy Plastinin

> Inconsistent behavior parsing JSON with unix timestamp values
> -
>
> Key: SPARK-12744
> URL: https://issues.apache.org/jira/browse/SPARK-12744
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Anatoliy Plastinin
>Assignee: Anatoliy Plastinin
>Priority: Minor
>  Labels: release_notes, releasenotes
>
> Let’s have following json
> {code}
> val rdd = sc.parallelize("""{"ts":1452386229}""" :: Nil)
> {code}
> Spark sql casts int to timestamp treating int value as a number of seconds.
> https://issues.apache.org/jira/browse/SPARK-11724
> {code}
> scala> sqlContext.read.json(rdd).select($"ts".cast(TimestampType)).show
> ++
> |  ts|
> ++
> |2016-01-10 01:37:...|
> ++
> {code}
> However parsing json with schema gives different result
> {code}
> scala> val schema = (new StructType).add("ts", TimestampType)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(ts,TimestampType,true))
> scala> sqlContext.read.schema(schema).json(rdd).show
> ++
> |  ts|
> ++
> |1970-01-17 20:26:...|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12755) Spark may attempt to rebuild application UI before finishing writing the event logs in possible race condition

2016-01-11 Thread Michael Allman (JIRA)
Michael Allman created SPARK-12755:
--

 Summary: Spark may attempt to rebuild application UI before 
finishing writing the event logs in possible race condition
 Key: SPARK-12755
 URL: https://issues.apache.org/jira/browse/SPARK-12755
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2
Reporter: Michael Allman
Priority: Minor


As reported in SPARK-6950, it appears that sometimes the standalone master 
attempts to build an application's historical UI before closing the app's event 
log. This is still an issue for us in 1.5.2+, and I believe I've found the 
underlying cause.

When stopping a {{SparkContext}}, the {{stop}} method stops the DAG scheduler:

https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727

and then stops the event logger:

https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727

Though it is difficult to follow the chain of events, one of the sequelae of 
stopping the DAG scheduler is that the master's {{rebuildSparkUI}} method is 
called. This method looks for the application's event logs, and its behavior 
varies based on the existence of an {{.inprogress}} file suffix. In particular, 
a warning is logged if this suffix exists:

https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L935

After calling the {{stop}} method on the DAG scheduler, the {{SparkContext}} 
stops the event logger:

https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1734-L1736

This renames the event log, dropping the {{.inprogress}} file sequence.

As such, a race condition exists where the master may attempt to process the 
application log file before finalizing it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12744) Inconsistent behavior parsing JSON with unix timestamp values

2016-01-11 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12744.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

This issue has been resolved by https://github.com/apache/spark/pull/10687.

> Inconsistent behavior parsing JSON with unix timestamp values
> -
>
> Key: SPARK-12744
> URL: https://issues.apache.org/jira/browse/SPARK-12744
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Anatoliy Plastinin
>Assignee: Anatoliy Plastinin
>Priority: Minor
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> Let’s have following json
> {code}
> val rdd = sc.parallelize("""{"ts":1452386229}""" :: Nil)
> {code}
> Spark sql casts int to timestamp treating int value as a number of seconds.
> https://issues.apache.org/jira/browse/SPARK-11724
> {code}
> scala> sqlContext.read.json(rdd).select($"ts".cast(TimestampType)).show
> ++
> |  ts|
> ++
> |2016-01-10 01:37:...|
> ++
> {code}
> However parsing json with schema gives different result
> {code}
> scala> val schema = (new StructType).add("ts", TimestampType)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(ts,TimestampType,true))
> scala> sqlContext.read.schema(schema).json(rdd).show
> ++
> |  ts|
> ++
> |1970-01-17 20:26:...|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12744) Inconsistent behavior parsing JSON with unix timestamp values

2016-01-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092441#comment-15092441
 ] 

Yin Huai commented on SPARK-12744:
--

[~antlypls] Can you add a comment to summarize the change (it will help us to 
prepare the release notes)?

> Inconsistent behavior parsing JSON with unix timestamp values
> -
>
> Key: SPARK-12744
> URL: https://issues.apache.org/jira/browse/SPARK-12744
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Anatoliy Plastinin
>Assignee: Anatoliy Plastinin
>Priority: Minor
>  Labels: release_notes, releasenotes
> Fix For: 2.0.0
>
>
> Let’s have following json
> {code}
> val rdd = sc.parallelize("""{"ts":1452386229}""" :: Nil)
> {code}
> Spark sql casts int to timestamp treating int value as a number of seconds.
> https://issues.apache.org/jira/browse/SPARK-11724
> {code}
> scala> sqlContext.read.json(rdd).select($"ts".cast(TimestampType)).show
> ++
> |  ts|
> ++
> |2016-01-10 01:37:...|
> ++
> {code}
> However parsing json with schema gives different result
> {code}
> scala> val schema = (new StructType).add("ts", TimestampType)
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(ts,TimestampType,true))
> scala> sqlContext.read.schema(schema).json(rdd).show
> ++
> |  ts|
> ++
> |1970-01-17 20:26:...|
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12732) Fix LinearRegression.train for the case when label is constant and fitIntercept=false

2016-01-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12732:


Assignee: Apache Spark

> Fix LinearRegression.train for the case when label is constant and 
> fitIntercept=false
> -
>
> Key: SPARK-12732
> URL: https://issues.apache.org/jira/browse/SPARK-12732
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Imran Younus
>Assignee: Apache Spark
>Priority: Minor
>
> If the target variable is constant, then the linear regression must check if 
> the fitIntercept is true or false, and handle these two cases separately.
> If the fitIntercept is true, then there is no training needed and we set the 
> intercept equal to the mean of y.
> But if the fit intercept is false, then the model should still train.
> Currently, LinearRegression handles both cases in the same way. It doesn't 
> train the model and sets the intercept equal to the mean of y. Which, means 
> that it returns a non-zero intercept even when the user forces the regression 
> through the origin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed

2016-01-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092509#comment-15092509
 ] 

Apache Spark commented on SPARK-6950:
-

User 'mallman' has created a pull request for this issue:
https://github.com/apache/spark/pull/10700

> Spark master UI believes some applications are in progress when they are 
> actually completed
> ---
>
> Key: SPARK-6950
> URL: https://issues.apache.org/jira/browse/SPARK-6950
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Matt Cheah
> Fix For: 1.3.1
>
>
> In Spark 1.2.x, I was able to set my spark event log directory to be a 
> different location from the default, and after the job finishes, I can replay 
> the UI by clicking on the appropriate link under "Completed Applications".
> Now, on a non-deterministic basis (but seems to happen most of the time), 
> when I click on the link under "Completed Applications", I instead get a 
> webpage that says:
> Application history not found (app-20150415052927-0014)
> Application myApp is still in progress.
> I am able to view the application's UI using the Spark history server, so 
> something regressed in the Spark master code between 1.2 and 1.3, but that 
> regression does not apply in the history server use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12732) Fix LinearRegression.train for the case when label is constant and fitIntercept=false

2016-01-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092559#comment-15092559
 ] 

Apache Spark commented on SPARK-12732:
--

User 'iyounus' has created a pull request for this issue:
https://github.com/apache/spark/pull/10702

> Fix LinearRegression.train for the case when label is constant and 
> fitIntercept=false
> -
>
> Key: SPARK-12732
> URL: https://issues.apache.org/jira/browse/SPARK-12732
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Imran Younus
>Priority: Minor
>
> If the target variable is constant, then the linear regression must check if 
> the fitIntercept is true or false, and handle these two cases separately.
> If the fitIntercept is true, then there is no training needed and we set the 
> intercept equal to the mean of y.
> But if the fit intercept is false, then the model should still train.
> Currently, LinearRegression handles both cases in the same way. It doesn't 
> train the model and sets the intercept equal to the mean of y. Which, means 
> that it returns a non-zero intercept even when the user forces the regression 
> through the origin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12732) Fix LinearRegression.train for the case when label is constant and fitIntercept=false

2016-01-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12732:


Assignee: (was: Apache Spark)

> Fix LinearRegression.train for the case when label is constant and 
> fitIntercept=false
> -
>
> Key: SPARK-12732
> URL: https://issues.apache.org/jira/browse/SPARK-12732
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Imran Younus
>Priority: Minor
>
> If the target variable is constant, then the linear regression must check if 
> the fitIntercept is true or false, and handle these two cases separately.
> If the fitIntercept is true, then there is no training needed and we set the 
> intercept equal to the mean of y.
> But if the fit intercept is false, then the model should still train.
> Currently, LinearRegression handles both cases in the same way. It doesn't 
> train the model and sets the intercept equal to the mean of y. Which, means 
> that it returns a non-zero intercept even when the user forces the regression 
> through the origin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12756) use hash expression in Exchange

2016-01-11 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-12756:
---

 Summary: use hash expression in Exchange
 Key: SPARK-12756
 URL: https://issues.apache.org/jira/browse/SPARK-12756
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7831) Mesos dispatcher doesn't deregister as a framework from Mesos when stopped

2016-01-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7831:
---

Assignee: Apache Spark

> Mesos dispatcher doesn't deregister as a framework from Mesos when stopped
> --
>
> Key: SPARK-7831
> URL: https://issues.apache.org/jira/browse/SPARK-7831
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0-rc1, Mesos 0.2.2 (compiled from source)
>Reporter: Luc Bourlier
>Assignee: Apache Spark
>
> To run Spark on Mesos in cluster mode, a Spark Mesos dispatcher has to be 
> running.
> It is launched using {{sbin/start-mesos-dispatcher.sh}}. The Mesos dispatcher 
> registers as a framework in the Mesos cluster.
> After using {{sbin/stop-mesos-dispatcher.sh}} to stop the dispatcher, the 
> application is correctly terminated locally, but the framework is still 
> listed as {{active}} in the Mesos dashboard.
> I would expect the framework to be de-registered when the dispatcher is 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7831) Mesos dispatcher doesn't deregister as a framework from Mesos when stopped

2016-01-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092554#comment-15092554
 ] 

Apache Spark commented on SPARK-7831:
-

User 'nraychaudhuri' has created a pull request for this issue:
https://github.com/apache/spark/pull/10701

> Mesos dispatcher doesn't deregister as a framework from Mesos when stopped
> --
>
> Key: SPARK-7831
> URL: https://issues.apache.org/jira/browse/SPARK-7831
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0-rc1, Mesos 0.2.2 (compiled from source)
>Reporter: Luc Bourlier
>
> To run Spark on Mesos in cluster mode, a Spark Mesos dispatcher has to be 
> running.
> It is launched using {{sbin/start-mesos-dispatcher.sh}}. The Mesos dispatcher 
> registers as a framework in the Mesos cluster.
> After using {{sbin/stop-mesos-dispatcher.sh}} to stop the dispatcher, the 
> application is correctly terminated locally, but the framework is still 
> listed as {{active}} in the Mesos dashboard.
> I would expect the framework to be de-registered when the dispatcher is 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7831) Mesos dispatcher doesn't deregister as a framework from Mesos when stopped

2016-01-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7831:
---

Assignee: (was: Apache Spark)

> Mesos dispatcher doesn't deregister as a framework from Mesos when stopped
> --
>
> Key: SPARK-7831
> URL: https://issues.apache.org/jira/browse/SPARK-7831
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0-rc1, Mesos 0.2.2 (compiled from source)
>Reporter: Luc Bourlier
>
> To run Spark on Mesos in cluster mode, a Spark Mesos dispatcher has to be 
> running.
> It is launched using {{sbin/start-mesos-dispatcher.sh}}. The Mesos dispatcher 
> registers as a framework in the Mesos cluster.
> After using {{sbin/stop-mesos-dispatcher.sh}} to stop the dispatcher, the 
> application is correctly terminated locally, but the framework is still 
> listed as {{active}} in the Mesos dashboard.
> I would expect the framework to be de-registered when the dispatcher is 
> stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12714) Transforming Dataset with sequences of case classes to RDD causes Task Not Serializable exception

2016-01-11 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092591#comment-15092591
 ] 

Michael Armbrust commented on SPARK-12714:
--

Would you be able to test with {{branch-1.6}}?  I backported a bunch of fixes 
after the release.

> Transforming Dataset with sequences of case classes to RDD causes Task Not 
> Serializable exception
> -
>
> Key: SPARK-12714
> URL: https://issues.apache.org/jira/browse/SPARK-12714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: linux 3.13.0-24-generic, scala 2.10.6
>Reporter: James Eastwood
>
> Attempting to transform a Dataset of a case class containing a nested 
> sequence of case classes causes an exception to be thrown: 
> `org.apache.spark.SparkException: Task not serializable`.
> Here is a minimum repro:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkContext, SparkConf}
> case class Top(a: String, nested: Array[Nested])
> case class Nested(b: String)
> object scratch {
>   def main ( args: Array[String] ) {
> lazy val sparkConf = new 
> SparkConf().setAppName("scratch").setMaster("local[1]")
> lazy val sparkContext = new SparkContext(sparkConf)
> lazy val sqlContext = new SQLContext(sparkContext)
> val input = List(
>   """{ "a": "123", "nested": [{ "b": "123" }] }"""
> )
> import sqlContext.implicits._
> val ds = sqlContext.read.json(sparkContext.parallelize(input)).as[Top]
> ds.rdd.foreach(println)
> sparkContext.stop()
>   }
> }
> {code}
> {code}
> scalaVersion := "2.10.6"
> lazy val sparkVersion = "1.6.0"
> libraryDependencies ++= List(
>   "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
>   "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
>   "org.apache.spark" %% "spark-hive" % sparkVersion % "provided"
> )
> {code}
> Full stack trace:
> {code}
> [error] (run-main-0) org.apache.spark.SparkException: Task not serializable
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706)
>   at org.apache.spark.sql.Dataset.rdd(Dataset.scala:166)
>   at scratch$.main(scratch.scala:26)
>   at scratch.main(scratch.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
> Caused by: java.io.NotSerializableException: 
> scala.reflect.internal.Mirrors$Roots$EmptyPackageClass$
> Serialization stack:
>   - object not serializable (class: 
> scala.reflect.internal.Mirrors$Roots$EmptyPackageClass$, value: package 
> )
>   - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
> class scala.reflect.internal.Symbols$Symbol)
>   - object (class scala.reflect.internal.Types$UniqueThisType, )
>   - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
> class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$TypeRef$$anon$6, Nested)
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2,
>  name: elementType$1, type: class scala.reflect.api.Types$TypeApi)
>   - object (class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2,
>  )
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2$$anonfun$apply$1,
>  name: $outer, type: class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2)
>   - object (class 
> 

<    1   2   3