[jira] [Updated] (SPARK-12753) Import error during unit test while calling a function from reduceByKey()
[ https://issues.apache.org/jira/browse/SPARK-12753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dat Tran updated SPARK-12753: - Description: The current directory structure for my test script is as follows: project/ script/ __init__.py map.py test/ __init.py__ test_map.py I have attached map.py and test_map.py file with this issue. When I run the nosetest in the test directory, the test fails. I get no module named "script" found error. However when I modify the map_add function to replace the call to add within reduceByKey in map.py like this: def map_add(df): result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: x+y) return result The test passes. Also, when I run the original test_map.py from the project directory, the test passes. I am not able to figure out why the test doesn't detect the script module when it is within the test directory. I have also attached the log error file. Any help will be much appreciated. was: The current directory structure for my test script is as follows: project/ script/ __init__.py map.py test/ __init.py__ test_map.py I have attached map.py and test_map.py file with this issue. When I run the nosetest in the test directory, the test fails. I get no module named "script" found error. However when I modify the map_add function to replace the call to add within reduceByKey in map.py like this: def map_add(df): result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: x+y) return result The test passes. Also, when I run the original test_map.py from the project directory, the test passes. I am not able to figure out why the test doesn't detect the script module when it is within the test directory. I have also attached the log error file. Any help will be much appreciated. > Import error during unit test while calling a function from reduceByKey() > - > > Key: SPARK-12753 > URL: https://issues.apache.org/jira/browse/SPARK-12753 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 1.6.0 > Environment: El Capitan, Single cluster Hadoop, Python 3, Spark 1.6, > Anaconda >Reporter: Dat Tran >Priority: Trivial > Labels: pyspark, python3, unit-test > Attachments: log.txt, map.py, test_map.py > > > The current directory structure for my test script is as follows: > project/ >script/ > __init__.py > map.py >test/ > __init.py__ > test_map.py > I have attached map.py and test_map.py file with this issue. > When I run the nosetest in the test directory, the test fails. I get no > module named "script" found error. > However when I modify the map_add function to replace the call to add within > reduceByKey in map.py like this: > def map_add(df): > result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: > x+y) > return result > The test passes. > Also, when I run the original test_map.py from the project directory, the > test passes. > I am not able to figure out why the test doesn't detect the script module > when it is within the test directory. > I have also attached the log error file. Any help will be much appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12753) Import error during unit test while calling a function from reduceByKey()
Dat Tran created SPARK-12753: Summary: Import error during unit test while calling a function from reduceByKey() Key: SPARK-12753 URL: https://issues.apache.org/jira/browse/SPARK-12753 Project: Spark Issue Type: Question Components: PySpark Affects Versions: 1.6.0 Environment: El Capitan, Single cluster Hadoop, Python 3, Spark 1.6, Anaconda Reporter: Dat Tran Priority: Trivial The current directory structure for my test script is as follows: project/ script/ __init__.py map.py test/ __init.py__ test_map.py I have attached map.py and test_map.py file with this issue. When I run the nosetest in the test directory, the test fails. I get no module named "script" found error. However when I modify the map_add function to replace the call to add within reduceByKey in map.py like this: def map_add(df): result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: x+y) return result The test passes. Also, when I run the original test_map.py from the project directory, the test passes. I am not able to figure out why the test doesn't detect the script module when it is within the test directory. I have also attached the log error file. Any help will be much appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12754) Data type mismatch on two array values when using filter/where
Jesse English created SPARK-12754: - Summary: Data type mismatch on two array values when using filter/where Key: SPARK-12754 URL: https://issues.apache.org/jira/browse/SPARK-12754 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0, 1.5.0 Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.5.0+ Reporter: Jesse English The following test produces the error _org.apache.spark.sql.AnalysisException: cannot resolve '(point = array(0,9))' due to data type mismatch: differing types in '(point = array(0,9))' (array and array)_ This is not the case on 1.4.x, but has been introduced with 1.5+. Is there a preferred method for making this sort of arbitrarily sized array comparison? {code:title=test.scala} test("test array comparison") { val vectors: Vector[Row] = Vector( Row.fromTuple("id_1" -> Array(0L, 2L)), Row.fromTuple("id_2" -> Array(0L, 5L)), Row.fromTuple("id_3" -> Array(0L, 9L)), Row.fromTuple("id_4" -> Array(1L, 0L)), Row.fromTuple("id_5" -> Array(1L, 8L)), Row.fromTuple("id_6" -> Array(2L, 4L)), Row.fromTuple("id_7" -> Array(5L, 6L)), Row.fromTuple("id_8" -> Array(6L, 2L)), Row.fromTuple("id_9" -> Array(7L, 0L)) ) val data: RDD[Row] = sc.parallelize(vectors, 3) val schema = StructType( StructField("id", StringType, false) :: StructField("point", DataTypes.createArrayType(LongType), false) :: Nil ) val sqlContext = new SQLContext(sc) var dataframe = sqlContext.createDataFrame(data, schema) val targetPoint:Array[Long] = Array(0L,9L) //This is the line where it fails //org.apache.spark.sql.AnalysisException: cannot resolve // '(point = array(0,9))' due to data type mismatch: // differing types in '(point = array(0,9))' // (array and array). val targetRow = dataframe.where(dataframe("point") === array(targetPoint.map(value => lit(value)): _*)).first() assert(targetRow != null) } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12753) Import error during unit test while calling a function from reduceByKey()
[ https://issues.apache.org/jira/browse/SPARK-12753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dat Tran updated SPARK-12753: - Attachment: map.py > Import error during unit test while calling a function from reduceByKey() > - > > Key: SPARK-12753 > URL: https://issues.apache.org/jira/browse/SPARK-12753 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 1.6.0 > Environment: El Capitan, Single cluster Hadoop, Python 3, Spark 1.6, > Anaconda >Reporter: Dat Tran >Priority: Trivial > Labels: pyspark, python3, unit-test > Attachments: map.py > > > The current directory structure for my test script is as follows: > project/ > script/ > __init__.py > map.py > test/ > __init.py__ > test_map.py > I have attached map.py and test_map.py file with this issue. > When I run the nosetest in the test directory, the test fails. I get no > module named "script" found error. > However when I modify the map_add function to replace the call to add within > reduceByKey in map.py like this: > def map_add(df): > result = df.map(lambda x: (x.key, x.value)).reduceByKey(lambda x,y: > x+y) > return result > The test passes. > Also, when I run the original test_map.py from the project directory, the > test passes. > I am not able to figure out why the test doesn't detect the script module > when it is within the test directory. > I have also attached the log error file. Any help will be much appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
[ https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092427#comment-15092427 ] Yin Huai commented on SPARK-12403: -- [~lunendl] Also, have you reported to simba? If there is any public page that can tracks that issue, it will be good to post it at here. (btw, from the error message, looks like the odbc driver got the wrong database name. I am not sure if it is a problem of the odbc driver or the spark sql's thrift server. We will try to investigate when we get a chance.) > "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore > > > Key: SPARK-12403 > URL: https://issues.apache.org/jira/browse/SPARK-12403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: ODBC connector query >Reporter: Lunen > > We are unable to query the SPARK tables using the ODBC driver from Simba > Spark(Databricks - "Simba Spark ODBC Driver 1.0") We are able to do a show > databases and show tables, but not any queries. eg. > Working: > Select * from openquery(SPARK,'SHOW DATABASES') > Select * from openquery(SPARK,'SHOW TABLES') > Not working: > Select * from openquery(SPARK,'Select * from lunentest') > The error I get is: > OLE DB provider "MSDASQL" for linked server "SPARK" returned message > "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest". > Msg 7321, Level 16, State 2, Line 2 > An error occurred while preparing the query "Select * from lunentest" for > execution against OLE DB provider "MSDASQL" for linked server "SPARK" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12646) Support _HOST in kerberos principal for connecting to secure cluster
[ https://issues.apache.org/jira/browse/SPARK-12646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092460#comment-15092460 ] Marcelo Vanzin commented on SPARK-12646: Can you convince people to at least use proper credentials to launch the Spark jobs instead of reusing YARN's? I'm a little wary of adding this feature just to support a broken use case. When running on YARN, Spark is a user application, and you're asking for Spark to authenticate using service principals. That's kinda wrong, even if it works. Your code also has a huge problem in that it uses {{InetAddress.getLocalHost}}; even if this were a desirable feature, there's no guarantee that's the correct host to use at all. On multi-homed machines, for example, which should be the address to use when expanding the principal template? You application can also login to kerberos before launching the Spark job; call kinit by yourself and then launch Spark without using "--principal" nor "--keytab". Then Spark doesn't need to do anything, it just inherits the kerberos ticket from your app. > Support _HOST in kerberos principal for connecting to secure cluster > > > Key: SPARK-12646 > URL: https://issues.apache.org/jira/browse/SPARK-12646 > Project: Spark > Issue Type: Improvement > Components: YARN >Reporter: Hari Krishna Dara >Priority: Minor > Labels: security > > Hadoop supports _HOST as a token that is dynamically replaced with the actual > hostname at the time the kerberos authentication is done. This is supported > in many hadoop stacks including YARN. When configuring Spark to connect to > secure cluster (e.g., yarn-cluster or yarn-client as master), it would be > natural to extend support for this token to Spark as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4389) Set akka.remote.netty.tcp.bind-hostname="0.0.0.0" so driver can be located behind NAT
[ https://issues.apache.org/jira/browse/SPARK-4389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092481#comment-15092481 ] Alan Braithwaite commented on SPARK-4389: - So is there any hope for running spark behind a transparent proxy then? What is the preferred method for running a spark-master in an environment where things get dynamically scheduled (mesos+marathon, kubernetes, etc)? > Set akka.remote.netty.tcp.bind-hostname="0.0.0.0" so driver can be located > behind NAT > - > > Key: SPARK-4389 > URL: https://issues.apache.org/jira/browse/SPARK-4389 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Josh Rosen >Priority: Minor > > We should set {{akka.remote.netty.tcp.bind-hostname="0.0.0.0"}} in our Akka > configuration so that Spark drivers can be located behind NATs / work with > weird DNS setups. > This is blocked by upgrading our Akka version, since this configuration is > not present Akka 2.3.4. There might be a different approach / workaround > that works on our current Akka version, though. > EDIT: this is blocked by Akka 2.4, since this feature is only available in > the 2.4 snapshot release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
[ https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092425#comment-15092425 ] Yin Huai commented on SPARK-12403: -- [~lunendl] Can you try to add db name to the from clause and see if you can workaround the issue (using {{Select * from openquery(SPARK,'Select * from yourDBName.lunentest')}})? > "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore > > > Key: SPARK-12403 > URL: https://issues.apache.org/jira/browse/SPARK-12403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: ODBC connector query >Reporter: Lunen > > We are unable to query the SPARK tables using the ODBC driver from Simba > Spark(Databricks - "Simba Spark ODBC Driver 1.0") We are able to do a show > databases and show tables, but not any queries. eg. > Working: > Select * from openquery(SPARK,'SHOW DATABASES') > Select * from openquery(SPARK,'SHOW TABLES') > Not working: > Select * from openquery(SPARK,'Select * from lunentest') > The error I get is: > OLE DB provider "MSDASQL" for linked server "SPARK" returned message > "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest". > Msg 7321, Level 16, State 2, Line 2 > An error occurred while preparing the query "Select * from lunentest" for > execution against OLE DB provider "MSDASQL" for linked server "SPARK" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12430) Temporary folders do not get deleted after Task completes causing problems with disk space.
[ https://issues.apache.org/jira/browse/SPARK-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092445#comment-15092445 ] Jean-Baptiste Onofré commented on SPARK-12430: -- I think it's related to this commit: {code} 52f5754 Marcelo Vanzin on 1/21/15 at 11:38 PM (committed by Josh Rosen on 2/2/15 at 11:01 PM) Make sure only owner can read / write to directories created for the job. Whenever a directory is created by the utility method, immediately restrict its permissions so that only the owner has access to its contents. Signed-off-by: Josh Rosen{code} As it can be checked with the extras/java8-test, I will verify. Sorry for the delay, I keep you posted. > Temporary folders do not get deleted after Task completes causing problems > with disk space. > --- > > Key: SPARK-12430 > URL: https://issues.apache.org/jira/browse/SPARK-12430 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1, 1.5.2 > Environment: Ubuntu server >Reporter: Fede Bar > > We are experiencing an issue with automatic /tmp folder deletion after > framework completes. Completing a M/R job using Spark 1.5.2 (same behavior as > Spark 1.5.1) over Mesos will not delete some temporary folders causing free > disk space on server to exhaust. > Behavior of M/R job using Spark 1.4.1 over Mesos cluster: > - Launched using spark-submit on one cluster node. > - Following folders are created: */tmp/mesos/slaves/id#* , */tmp/spark-#/* , > */tmp/spark-#/blockmgr-#* > - When task is completed */tmp/spark-#/* gets deleted along with > */tmp/spark-#/blockmgr-#* sub-folder. > Behavior of M/R job using Spark 1.5.2 over Mesos cluster (same identical job): > - Launched using spark-submit on one cluster node. > - Following folders are created: */tmp/mesos/mesos/slaves/id** * , > */tmp/spark-***/ * ,{color:red} /tmp/blockmgr-***{color} > - When task is completed */tmp/spark-***/ * gets deleted but NOT shuffle > container folder {color:red} /tmp/blockmgr-***{color} > Unfortunately, {color:red} /tmp/blockmgr-***{color} can account for several > GB depending on the job that ran. Over time this causes disk space to become > full with consequences that we all know. > Running a shell script would probably work but it is difficult to identify > folders in use by a running M/R or stale folders. I did notice similar issues > opened by other users marked as "resolved", but none seems to exactly match > the above behavior. > I really hope someone has insights on how to fix it. > Thank you very much! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12755) Spark may attempt to rebuild application UI before finishing writing the event logs in possible race condition
[ https://issues.apache.org/jira/browse/SPARK-12755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092489#comment-15092489 ] Michael Allman commented on SPARK-12755: I'm going to put together a PR that simply reorders the call to stop the event logger so that it comes before the call to stop the DAG scheduler. > Spark may attempt to rebuild application UI before finishing writing the > event logs in possible race condition > -- > > Key: SPARK-12755 > URL: https://issues.apache.org/jira/browse/SPARK-12755 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.2 >Reporter: Michael Allman >Priority: Minor > > As reported in SPARK-6950, it appears that sometimes the standalone master > attempts to build an application's historical UI before closing the app's > event log. This is still an issue for us in 1.5.2+, and I believe I've found > the underlying cause. > When stopping a {{SparkContext}}, the {{stop}} method stops the DAG scheduler: > https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727 > and then stops the event logger: > https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727 > Though it is difficult to follow the chain of events, one of the sequelae of > stopping the DAG scheduler is that the master's {{rebuildSparkUI}} method is > called. This method looks for the application's event logs, and its behavior > varies based on the existence of an {{.inprogress}} file suffix. In > particular, a warning is logged if this suffix exists: > https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L935 > After calling the {{stop}} method on the DAG scheduler, the {{SparkContext}} > stops the event logger: > https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1734-L1736 > This renames the event log, dropping the {{.inprogress}} file sequence. > As such, a race condition exists where the master may attempt to process the > application log file before finalizing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092455#comment-15092455 ] Mark Grover commented on SPARK-12177: - Thanks Nikita. And, I will be issuing PR's to your kafka09-integration branch so it can become the single source of truth until this change gets merged into spark. And, I believe Spark community prefers discussion on PRs once they are filed, so you'll hear more from me there:-) > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12744) Inconsistent behavior parsing JSON with unix timestamp values
[ https://issues.apache.org/jira/browse/SPARK-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-12744: - Assignee: Anatoliy Plastinin > Inconsistent behavior parsing JSON with unix timestamp values > - > > Key: SPARK-12744 > URL: https://issues.apache.org/jira/browse/SPARK-12744 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Anatoliy Plastinin >Assignee: Anatoliy Plastinin >Priority: Minor > Labels: release_notes, releasenotes > > Let’s have following json > {code} > val rdd = sc.parallelize("""{"ts":1452386229}""" :: Nil) > {code} > Spark sql casts int to timestamp treating int value as a number of seconds. > https://issues.apache.org/jira/browse/SPARK-11724 > {code} > scala> sqlContext.read.json(rdd).select($"ts".cast(TimestampType)).show > ++ > | ts| > ++ > |2016-01-10 01:37:...| > ++ > {code} > However parsing json with schema gives different result > {code} > scala> val schema = (new StructType).add("ts", TimestampType) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(ts,TimestampType,true)) > scala> sqlContext.read.schema(schema).json(rdd).show > ++ > | ts| > ++ > |1970-01-17 20:26:...| > ++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12755) Spark may attempt to rebuild application UI before finishing writing the event logs in possible race condition
Michael Allman created SPARK-12755: -- Summary: Spark may attempt to rebuild application UI before finishing writing the event logs in possible race condition Key: SPARK-12755 URL: https://issues.apache.org/jira/browse/SPARK-12755 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.2 Reporter: Michael Allman Priority: Minor As reported in SPARK-6950, it appears that sometimes the standalone master attempts to build an application's historical UI before closing the app's event log. This is still an issue for us in 1.5.2+, and I believe I've found the underlying cause. When stopping a {{SparkContext}}, the {{stop}} method stops the DAG scheduler: https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727 and then stops the event logger: https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1722-L1727 Though it is difficult to follow the chain of events, one of the sequelae of stopping the DAG scheduler is that the master's {{rebuildSparkUI}} method is called. This method looks for the application's event logs, and its behavior varies based on the existence of an {{.inprogress}} file suffix. In particular, a warning is logged if this suffix exists: https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L935 After calling the {{stop}} method on the DAG scheduler, the {{SparkContext}} stops the event logger: https://github.com/apache/spark/blob/a76cf51ed91d99c88f301ec85f3cda1288bcf346/core/src/main/scala/org/apache/spark/SparkContext.scala#L1734-L1736 This renames the event log, dropping the {{.inprogress}} file sequence. As such, a race condition exists where the master may attempt to process the application log file before finalizing it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12744) Inconsistent behavior parsing JSON with unix timestamp values
[ https://issues.apache.org/jira/browse/SPARK-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-12744. -- Resolution: Fixed Fix Version/s: 2.0.0 This issue has been resolved by https://github.com/apache/spark/pull/10687. > Inconsistent behavior parsing JSON with unix timestamp values > - > > Key: SPARK-12744 > URL: https://issues.apache.org/jira/browse/SPARK-12744 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Anatoliy Plastinin >Assignee: Anatoliy Plastinin >Priority: Minor > Labels: release_notes, releasenotes > Fix For: 2.0.0 > > > Let’s have following json > {code} > val rdd = sc.parallelize("""{"ts":1452386229}""" :: Nil) > {code} > Spark sql casts int to timestamp treating int value as a number of seconds. > https://issues.apache.org/jira/browse/SPARK-11724 > {code} > scala> sqlContext.read.json(rdd).select($"ts".cast(TimestampType)).show > ++ > | ts| > ++ > |2016-01-10 01:37:...| > ++ > {code} > However parsing json with schema gives different result > {code} > scala> val schema = (new StructType).add("ts", TimestampType) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(ts,TimestampType,true)) > scala> sqlContext.read.schema(schema).json(rdd).show > ++ > | ts| > ++ > |1970-01-17 20:26:...| > ++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12744) Inconsistent behavior parsing JSON with unix timestamp values
[ https://issues.apache.org/jira/browse/SPARK-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092441#comment-15092441 ] Yin Huai commented on SPARK-12744: -- [~antlypls] Can you add a comment to summarize the change (it will help us to prepare the release notes)? > Inconsistent behavior parsing JSON with unix timestamp values > - > > Key: SPARK-12744 > URL: https://issues.apache.org/jira/browse/SPARK-12744 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Anatoliy Plastinin >Assignee: Anatoliy Plastinin >Priority: Minor > Labels: release_notes, releasenotes > Fix For: 2.0.0 > > > Let’s have following json > {code} > val rdd = sc.parallelize("""{"ts":1452386229}""" :: Nil) > {code} > Spark sql casts int to timestamp treating int value as a number of seconds. > https://issues.apache.org/jira/browse/SPARK-11724 > {code} > scala> sqlContext.read.json(rdd).select($"ts".cast(TimestampType)).show > ++ > | ts| > ++ > |2016-01-10 01:37:...| > ++ > {code} > However parsing json with schema gives different result > {code} > scala> val schema = (new StructType).add("ts", TimestampType) > schema: org.apache.spark.sql.types.StructType = > StructType(StructField(ts,TimestampType,true)) > scala> sqlContext.read.schema(schema).json(rdd).show > ++ > | ts| > ++ > |1970-01-17 20:26:...| > ++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12732) Fix LinearRegression.train for the case when label is constant and fitIntercept=false
[ https://issues.apache.org/jira/browse/SPARK-12732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12732: Assignee: Apache Spark > Fix LinearRegression.train for the case when label is constant and > fitIntercept=false > - > > Key: SPARK-12732 > URL: https://issues.apache.org/jira/browse/SPARK-12732 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Imran Younus >Assignee: Apache Spark >Priority: Minor > > If the target variable is constant, then the linear regression must check if > the fitIntercept is true or false, and handle these two cases separately. > If the fitIntercept is true, then there is no training needed and we set the > intercept equal to the mean of y. > But if the fit intercept is false, then the model should still train. > Currently, LinearRegression handles both cases in the same way. It doesn't > train the model and sets the intercept equal to the mean of y. Which, means > that it returns a non-zero intercept even when the user forces the regression > through the origin. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6950) Spark master UI believes some applications are in progress when they are actually completed
[ https://issues.apache.org/jira/browse/SPARK-6950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092509#comment-15092509 ] Apache Spark commented on SPARK-6950: - User 'mallman' has created a pull request for this issue: https://github.com/apache/spark/pull/10700 > Spark master UI believes some applications are in progress when they are > actually completed > --- > > Key: SPARK-6950 > URL: https://issues.apache.org/jira/browse/SPARK-6950 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Matt Cheah > Fix For: 1.3.1 > > > In Spark 1.2.x, I was able to set my spark event log directory to be a > different location from the default, and after the job finishes, I can replay > the UI by clicking on the appropriate link under "Completed Applications". > Now, on a non-deterministic basis (but seems to happen most of the time), > when I click on the link under "Completed Applications", I instead get a > webpage that says: > Application history not found (app-20150415052927-0014) > Application myApp is still in progress. > I am able to view the application's UI using the Spark history server, so > something regressed in the Spark master code between 1.2 and 1.3, but that > regression does not apply in the history server use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12732) Fix LinearRegression.train for the case when label is constant and fitIntercept=false
[ https://issues.apache.org/jira/browse/SPARK-12732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092559#comment-15092559 ] Apache Spark commented on SPARK-12732: -- User 'iyounus' has created a pull request for this issue: https://github.com/apache/spark/pull/10702 > Fix LinearRegression.train for the case when label is constant and > fitIntercept=false > - > > Key: SPARK-12732 > URL: https://issues.apache.org/jira/browse/SPARK-12732 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Imran Younus >Priority: Minor > > If the target variable is constant, then the linear regression must check if > the fitIntercept is true or false, and handle these two cases separately. > If the fitIntercept is true, then there is no training needed and we set the > intercept equal to the mean of y. > But if the fit intercept is false, then the model should still train. > Currently, LinearRegression handles both cases in the same way. It doesn't > train the model and sets the intercept equal to the mean of y. Which, means > that it returns a non-zero intercept even when the user forces the regression > through the origin. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12732) Fix LinearRegression.train for the case when label is constant and fitIntercept=false
[ https://issues.apache.org/jira/browse/SPARK-12732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12732: Assignee: (was: Apache Spark) > Fix LinearRegression.train for the case when label is constant and > fitIntercept=false > - > > Key: SPARK-12732 > URL: https://issues.apache.org/jira/browse/SPARK-12732 > Project: Spark > Issue Type: Bug > Components: MLlib >Reporter: Imran Younus >Priority: Minor > > If the target variable is constant, then the linear regression must check if > the fitIntercept is true or false, and handle these two cases separately. > If the fitIntercept is true, then there is no training needed and we set the > intercept equal to the mean of y. > But if the fit intercept is false, then the model should still train. > Currently, LinearRegression handles both cases in the same way. It doesn't > train the model and sets the intercept equal to the mean of y. Which, means > that it returns a non-zero intercept even when the user forces the regression > through the origin. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12756) use hash expression in Exchange
Wenchen Fan created SPARK-12756: --- Summary: use hash expression in Exchange Key: SPARK-12756 URL: https://issues.apache.org/jira/browse/SPARK-12756 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7831) Mesos dispatcher doesn't deregister as a framework from Mesos when stopped
[ https://issues.apache.org/jira/browse/SPARK-7831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7831: --- Assignee: Apache Spark > Mesos dispatcher doesn't deregister as a framework from Mesos when stopped > -- > > Key: SPARK-7831 > URL: https://issues.apache.org/jira/browse/SPARK-7831 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.4.0 > Environment: Spark 1.4.0-rc1, Mesos 0.2.2 (compiled from source) >Reporter: Luc Bourlier >Assignee: Apache Spark > > To run Spark on Mesos in cluster mode, a Spark Mesos dispatcher has to be > running. > It is launched using {{sbin/start-mesos-dispatcher.sh}}. The Mesos dispatcher > registers as a framework in the Mesos cluster. > After using {{sbin/stop-mesos-dispatcher.sh}} to stop the dispatcher, the > application is correctly terminated locally, but the framework is still > listed as {{active}} in the Mesos dashboard. > I would expect the framework to be de-registered when the dispatcher is > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7831) Mesos dispatcher doesn't deregister as a framework from Mesos when stopped
[ https://issues.apache.org/jira/browse/SPARK-7831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092554#comment-15092554 ] Apache Spark commented on SPARK-7831: - User 'nraychaudhuri' has created a pull request for this issue: https://github.com/apache/spark/pull/10701 > Mesos dispatcher doesn't deregister as a framework from Mesos when stopped > -- > > Key: SPARK-7831 > URL: https://issues.apache.org/jira/browse/SPARK-7831 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.4.0 > Environment: Spark 1.4.0-rc1, Mesos 0.2.2 (compiled from source) >Reporter: Luc Bourlier > > To run Spark on Mesos in cluster mode, a Spark Mesos dispatcher has to be > running. > It is launched using {{sbin/start-mesos-dispatcher.sh}}. The Mesos dispatcher > registers as a framework in the Mesos cluster. > After using {{sbin/stop-mesos-dispatcher.sh}} to stop the dispatcher, the > application is correctly terminated locally, but the framework is still > listed as {{active}} in the Mesos dashboard. > I would expect the framework to be de-registered when the dispatcher is > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7831) Mesos dispatcher doesn't deregister as a framework from Mesos when stopped
[ https://issues.apache.org/jira/browse/SPARK-7831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7831: --- Assignee: (was: Apache Spark) > Mesos dispatcher doesn't deregister as a framework from Mesos when stopped > -- > > Key: SPARK-7831 > URL: https://issues.apache.org/jira/browse/SPARK-7831 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.4.0 > Environment: Spark 1.4.0-rc1, Mesos 0.2.2 (compiled from source) >Reporter: Luc Bourlier > > To run Spark on Mesos in cluster mode, a Spark Mesos dispatcher has to be > running. > It is launched using {{sbin/start-mesos-dispatcher.sh}}. The Mesos dispatcher > registers as a framework in the Mesos cluster. > After using {{sbin/stop-mesos-dispatcher.sh}} to stop the dispatcher, the > application is correctly terminated locally, but the framework is still > listed as {{active}} in the Mesos dashboard. > I would expect the framework to be de-registered when the dispatcher is > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12714) Transforming Dataset with sequences of case classes to RDD causes Task Not Serializable exception
[ https://issues.apache.org/jira/browse/SPARK-12714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092591#comment-15092591 ] Michael Armbrust commented on SPARK-12714: -- Would you be able to test with {{branch-1.6}}? I backported a bunch of fixes after the release. > Transforming Dataset with sequences of case classes to RDD causes Task Not > Serializable exception > - > > Key: SPARK-12714 > URL: https://issues.apache.org/jira/browse/SPARK-12714 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: linux 3.13.0-24-generic, scala 2.10.6 >Reporter: James Eastwood > > Attempting to transform a Dataset of a case class containing a nested > sequence of case classes causes an exception to be thrown: > `org.apache.spark.SparkException: Task not serializable`. > Here is a minimum repro: > {code} > import org.apache.spark.sql.SQLContext > import org.apache.spark.{SparkContext, SparkConf} > case class Top(a: String, nested: Array[Nested]) > case class Nested(b: String) > object scratch { > def main ( args: Array[String] ) { > lazy val sparkConf = new > SparkConf().setAppName("scratch").setMaster("local[1]") > lazy val sparkContext = new SparkContext(sparkConf) > lazy val sqlContext = new SQLContext(sparkContext) > val input = List( > """{ "a": "123", "nested": [{ "b": "123" }] }""" > ) > import sqlContext.implicits._ > val ds = sqlContext.read.json(sparkContext.parallelize(input)).as[Top] > ds.rdd.foreach(println) > sparkContext.stop() > } > } > {code} > {code} > scalaVersion := "2.10.6" > lazy val sparkVersion = "1.6.0" > libraryDependencies ++= List( > "org.apache.spark" %% "spark-core" % sparkVersion % "provided", > "org.apache.spark" %% "spark-sql" % sparkVersion % "provided", > "org.apache.spark" %% "spark-hive" % sparkVersion % "provided" > ) > {code} > Full stack trace: > {code} > [error] (run-main-0) org.apache.spark.SparkException: Task not serializable > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2055) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:707) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:706) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:706) > at org.apache.spark.sql.Dataset.rdd(Dataset.scala:166) > at scratch$.main(scratch.scala:26) > at scratch.main(scratch.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > Caused by: java.io.NotSerializableException: > scala.reflect.internal.Mirrors$Roots$EmptyPackageClass$ > Serialization stack: > - object not serializable (class: > scala.reflect.internal.Mirrors$Roots$EmptyPackageClass$, value: package > ) > - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: > class scala.reflect.internal.Symbols$Symbol) > - object (class scala.reflect.internal.Types$UniqueThisType, ) > - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: > class scala.reflect.internal.Types$Type) > - object (class scala.reflect.internal.Types$TypeRef$$anon$6, Nested) > - field (class: > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2, > name: elementType$1, type: class scala.reflect.api.Types$TypeApi) > - object (class > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2, > ) > - field (class: > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2$$anonfun$apply$1, > name: $outer, type: class > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$constructorFor$2) > - object (class >