[jira] [Updated] (SPARK-5268) CoarseGrainedExecutorBackend exits for irrelevant DisassociatedEvent
[ https://issues.apache.org/jira/browse/SPARK-5268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nan Zhu updated SPARK-5268: --- Summary: CoarseGrainedExecutorBackend exits for irrelevant DisassociatedEvent (was: ExecutorBackend exits for irrelevant DisassociatedEvent) CoarseGrainedExecutorBackend exits for irrelevant DisassociatedEvent Key: SPARK-5268 URL: https://issues.apache.org/jira/browse/SPARK-5268 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Nan Zhu In CoarseGrainedExecutorBackend, we subscribe DisassociatedEvent in executor backend actor and exit the program upon receive such event... let's consider the following case The user may develop an Akka-based program which starts the actor with Spark's actor system and communicate with an external actor system (e.g. an Akka-based receiver in spark streaming which communicates with an external system) If the external actor system fails or disassociates with the actor within spark's system with purpose, we may receive DisassociatedEvent and the executor is restarted. This is not the expected behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5268) ExecutorBackend exits for irrelevant DisassociatedEvent
[ https://issues.apache.org/jira/browse/SPARK-5268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278786#comment-14278786 ] Apache Spark commented on SPARK-5268: - User 'CodingCat' has created a pull request for this issue: https://github.com/apache/spark/pull/4063 ExecutorBackend exits for irrelevant DisassociatedEvent --- Key: SPARK-5268 URL: https://issues.apache.org/jira/browse/SPARK-5268 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Nan Zhu In CoarseGrainedExecutorBackend, we subscribe DisassociatedEvent in executor backend actor and exit the program upon receive such event... let's consider the following case The user may develop an Akka-based program which starts the actor with Spark's actor system and communicate with an external actor system (e.g. an Akka-based receiver in spark streaming which communicates with an external system) If the external actor system fails or disassociates with the actor within spark's system with purpose, we may receive DisassociatedEvent and the executor is restarted. This is not the expected behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278947#comment-14278947 ] Travis Galoppo commented on SPARK-5012: --- This will probably be affected by SPARK-5019 Python API for Gaussian Mixture Model - Key: SPARK-5012 URL: https://issues.apache.org/jira/browse/SPARK-5012 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Meethu Mathew Add Python API for the Scala implementation of GMM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5185) pyspark --jars does not add classes to driver class path
[ https://issues.apache.org/jira/browse/SPARK-5185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279040#comment-14279040 ] Marcelo Vanzin commented on SPARK-5185: --- BTW I talked to Uri offline about this. The cause is that {{sc._jvm.blah}} seems to use the system class loader to load blah, and {{--jars}} adds things to the application class loader instantiated by SparkSubmit. e.g., this works: {code} sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(com.cloudera.science.throwaway.ThrowAway).newInstance() {code} That being said, I'm not sure what's the expectation here. {{_jvm}}, starting with an underscore, gives me the impression that it's not really supposed to be a public API. pyspark --jars does not add classes to driver class path Key: SPARK-5185 URL: https://issues.apache.org/jira/browse/SPARK-5185 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Uri Laserson I have some random class I want access to from an Spark shell, say {{com.cloudera.science.throwaway.ThrowAway}}. You can find the specific example I used here: https://gist.github.com/laserson/e9e3bd265e1c7a896652 I packaged it as {{throwaway.jar}}. If I then run {{bin/spark-shell}} like so: {code} bin/spark-shell --master local[1] --jars throwaway.jar {code} I can execute {code} val a = new com.cloudera.science.throwaway.ThrowAway() {code} Successfully. I now run PySpark like so: {code} PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars throwaway.jar {code} which gives me an error when I try to instantiate the class through Py4J: {code} In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway() --- Py4JError Traceback (most recent call last) ipython-input-1-4eedbe023c29 in module() 1 sc._jvm.com.cloudera.science.throwaway.ThrowAway() /Users/laserson/repos/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in __getattr__(self, name) 724 def __getattr__(self, name): 725 if name == '__call__': -- 726 raise Py4JError('Trying to call a package.') 727 new_fqn = self._fqn + '.' + name 728 command = REFLECTION_COMMAND_NAME +\ Py4JError: Trying to call a package. {code} However, if I explicitly add the {{--driver-class-path}} to add the same jar {code} PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars throwaway.jar --driver-class-path throwaway.jar {code} it works {code} In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway() Out[1]: JavaObject id=o18 {code} However, the docs state that {{--jars}} should also set the driver class path. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279078#comment-14279078 ] Reynold Xin commented on SPARK-5097: [~hkothari] that is correct. It will be trivially doable to select columns at runtime. For the 2nd one, not yet. That's a very good point. You can always do an extra projection. We will try to add it, if not in the 1st iteration, then in the 2nd iteration. Adding data frame APIs to SchemaRDD --- Key: SPARK-5097 URL: https://issues.apache.org/jira/browse/SPARK-5097 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf SchemaRDD, through its DSL, already provides common data frame functionalities. However, the DSL was originally created for constructing test cases without much end-user usability and API stability consideration. This design doc proposes a set of API changes for Scala and Python to make the SchemaRDD DSL API more usable and stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty
[ https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278983#comment-14278983 ] Al M commented on SPARK-5270: - I just noticed that rdd.partitions.size is set to 0 for empty RDDs and 0 for RDDs with data; this is a far more elegant check than the others. Elegantly check if RDD is empty --- Key: SPARK-5270 URL: https://issues.apache.org/jira/browse/SPARK-5270 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Environment: Centos 6 Reporter: Al M Priority: Trivial Right now there is no clean way to check if an RDD is empty. As discussed here: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 I'd like a method rdd.isEmpty that returns a boolean. This would be especially useful when using streams. Sometimes my batches are huge in one stream, sometimes I get nothing for hours. Still I have to run count() to check if there is anything in the RDD. I can process my empty RDD like the others but it would be more efficient to just skip the empty ones. I can also run first() and catch the exception; this is neither a clean nor fast solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty
[ https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278993#comment-14278993 ] Sean Owen commented on SPARK-5270: -- I think it's conceivable to have an RDD with no elements but nonzero partitions though. Witness: {code} val empty = sc.parallelize(Array[Int]()) empty.count ... 0 empty.partitions.size ... 8 {code} Elegantly check if RDD is empty --- Key: SPARK-5270 URL: https://issues.apache.org/jira/browse/SPARK-5270 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Environment: Centos 6 Reporter: Al M Priority: Trivial Right now there is no clean way to check if an RDD is empty. As discussed here: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 I'd like a method rdd.isEmpty that returns a boolean. This would be especially useful when using streams. Sometimes my batches are huge in one stream, sometimes I get nothing for hours. Still I have to run count() to check if there is anything in the RDD. I can process my empty RDD like the others but it would be more efficient to just skip the empty ones. I can also run first() and catch the exception; this is neither a clean nor fast solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5270) Elegantly check if RDD is empty
Al M created SPARK-5270: --- Summary: Elegantly check if RDD is empty Key: SPARK-5270 URL: https://issues.apache.org/jira/browse/SPARK-5270 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Environment: Centos 6 Reporter: Al M Priority: Trivial Right now there is no clean way to check if an RDD is empty. As discussed here: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 This is especially a problem when using streams. Sometimes my batches are huge in one stream, sometimes i get nothing for hours. Still I have to run count() to check if there is anything in the RDD. I can also run first() and catch the exception; this is neither a clean nor fast solution. I'd like a method rdd.isEmpty that returns a boolean. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5267) Add a streaming module to ingest Apache Camel Messages from a configured endpoints
[ https://issues.apache.org/jira/browse/SPARK-5267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Brewin updated SPARK-5267: Description: The number of input stream protocols supported by Spark Streaming is quite limited, which constrains the number of systems with which it can be integrated. This proposal solves the problem by adding an optional module that integrates Apache Camel, which supports many additional input protocols. Our tried and tested implementation of this proposal is spark-streaming-camel. An Apache Camel service is run on a separate Thread, consuming each http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html and storing it into Spark's memory. The provider of the Message is specified by any consuming component URI documented at http://camel.apache.org/components.html, making all of these protocols available to Spark Streaming. Thoughts? was: The number of input stream protocols supported by Spark Streaming is quite limited, which constrains the number of systems with which it can be integrated. This proposal solves the problem by adding an optional module that integrates Apache Camel, which support many more input protocols. Our tried and tested implementation of this proposal is spark-streaming-camel. An Apache Camel service is run on a separate Thread, consuming each http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html and storing it into Spark's memory. The provider of the Message is specified by any consuming component URI documented at http://camel.apache.org/components.html, making all of these protocols available to Spark Streaming. Thoughts? Add a streaming module to ingest Apache Camel Messages from a configured endpoints -- Key: SPARK-5267 URL: https://issues.apache.org/jira/browse/SPARK-5267 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.2.0 Reporter: Steve Brewin Labels: features Original Estimate: 120h Remaining Estimate: 120h The number of input stream protocols supported by Spark Streaming is quite limited, which constrains the number of systems with which it can be integrated. This proposal solves the problem by adding an optional module that integrates Apache Camel, which supports many additional input protocols. Our tried and tested implementation of this proposal is spark-streaming-camel. An Apache Camel service is run on a separate Thread, consuming each http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html and storing it into Spark's memory. The provider of the Message is specified by any consuming component URI documented at http://camel.apache.org/components.html, making all of these protocols available to Spark Streaming. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5271) PySpark History Web UI issues
[ https://issues.apache.org/jira/browse/SPARK-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Zimovnov updated SPARK-5271: --- Component/s: Web UI PySpark History Web UI issues - Key: SPARK-5271 URL: https://issues.apache.org/jira/browse/SPARK-5271 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.2.0 Environment: PySpark 1.2.0 in yarn-client mode Reporter: Andrey Zimovnov After successful run of PySpark app via spark-submit in yarn-client mode on Hadoop 2.4 cluster the History UI shows the same as in issue SPARK-3898. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5271) PySpark History Web UI issues
Andrey Zimovnov created SPARK-5271: -- Summary: PySpark History Web UI issues Key: SPARK-5271 URL: https://issues.apache.org/jira/browse/SPARK-5271 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Environment: PySpark 1.2.0 in yarn-client mode Reporter: Andrey Zimovnov After successful run of PySpark app via spark-submit in yarn-client mode on Hadoop 2.4 cluster the History UI shows the same as in issue SPARK-3898. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5270) Elegantly check if RDD is empty
[ https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Al M updated SPARK-5270: Description: Right now there is no clean way to check if an RDD is empty. As discussed here: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 I'd like a method rdd.isEmpty that returns a boolean. This would be especially useful when using streams. Sometimes my batches are huge in one stream, sometimes I get nothing for hours. Still I have to run count() to check if there is anything in the RDD. I can process my empty RDD like the others but it would be more efficient to just skip the empty ones. I can also run first() and catch the exception; this is neither a clean nor fast solution. was: Right now there is no clean way to check if an RDD is empty. As discussed here: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 This is especially a problem when using streams. Sometimes my batches are huge in one stream, sometimes i get nothing for hours. Still I have to run count() to check if there is anything in the RDD. I can also run first() and catch the exception; this is neither a clean nor fast solution. I'd like a method rdd.isEmpty that returns a boolean. Elegantly check if RDD is empty --- Key: SPARK-5270 URL: https://issues.apache.org/jira/browse/SPARK-5270 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Environment: Centos 6 Reporter: Al M Priority: Trivial Right now there is no clean way to check if an RDD is empty. As discussed here: http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679 I'd like a method rdd.isEmpty that returns a boolean. This would be especially useful when using streams. Sometimes my batches are huge in one stream, sometimes I get nothing for hours. Still I have to run count() to check if there is anything in the RDD. I can process my empty RDD like the others but it would be more efficient to just skip the empty ones. I can also run first() and catch the exception; this is neither a clean nor fast solution. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5246) spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve
[ https://issues.apache.org/jira/browse/SPARK-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278780#comment-14278780 ] Vladimir Grigor commented on SPARK-5246: https://github.com/mesos/spark-ec2/pull/91 spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve -- Key: SPARK-5246 URL: https://issues.apache.org/jira/browse/SPARK-5246 Project: Spark Issue Type: Bug Components: EC2 Reporter: Vladimir Grigor How to reproduce: 1) http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html should be sufficient to setup VPC for this bug. After you followed that guide, start new instance in VPC, ssh to it (though NAT server) 2) user starts a cluster in VPC: {code} ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 --subnet-id=subnet-2571dd4d --zone=eu-west-1a launch SparkByScript Setting up security groups... (omitted for brevity) 10.1.1.62 10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop no org.apache.spark.deploy.master.Master to stop starting org.apache.spark.deploy.master.Master, logging to /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out failed to launch org.apache.spark.deploy.master.Master: at java.net.InetAddress.getLocalHost(InetAddress.java:1469) ... 12 more full log in /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out 10.1.1.62: starting org.apache.spark.deploy.worker.Worker, logging to /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out 10.1.1.62: failed to launch org.apache.spark.deploy.worker.Worker: 10.1.1.62:at java.net.InetAddress.getLocalHost(InetAddress.java:1469) 10.1.1.62:... 12 more 10.1.1.62: full log in /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out [timing] spark-standalone setup: 00h 00m 28s (omitted for brevity) {code} /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out {code} Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp :::/root/ephemeral-hdfs/conf:/root/spark/sbin/../conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip 10.1.1.151 --port 7077 --webui-port 8080 15/01/14 07:34:47 INFO master.Master: Registered signal handlers for [TERM, HUP, INT] Exception in thread main java.net.UnknownHostException: ip-10-1-1-151: ip-10-1-1-151: Name or service not known at java.net.InetAddress.getLocalHost(InetAddress.java:1473) at org.apache.spark.util.Utils$.findLocalIpAddress(Utils.scala:620) at org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:612) at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:612) at org.apache.spark.util.Utils$.localIpAddressHostname$lzycompute(Utils.scala:613) at org.apache.spark.util.Utils$.localIpAddressHostname(Utils.scala:613) at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665) at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.util.Utils$.localHostName(Utils.scala:665) at org.apache.spark.deploy.master.MasterArguments.init(MasterArguments.scala:27) at org.apache.spark.deploy.master.Master$.main(Master.scala:819) at org.apache.spark.deploy.master.Master.main(Master.scala) Caused by: java.net.UnknownHostException: ip-10-1-1-151: Name or service not known at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) at java.net.InetAddress.getLocalHost(InetAddress.java:1469) ... 12 more {code} Problem is that instance launched in VPC may be not able to resolve own local hostname. Please see https://forums.aws.amazon.com/thread.jspa?threadID=92092. I am going to submit a fix for this problem since I need this functionality asap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To
[jira] [Created] (SPARK-5268) ExecutorBackend exits for irrelevant DisassociatedEvent
Nan Zhu created SPARK-5268: -- Summary: ExecutorBackend exits for irrelevant DisassociatedEvent Key: SPARK-5268 URL: https://issues.apache.org/jira/browse/SPARK-5268 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Nan Zhu In CoarseGrainedExecutorBackend, we subscribe DisassociatedEvent in executor backend actor and exit the program upon receive such event... let's consider the following case The user may develop an Akka-based program which starts the actor with Spark's actor system and communicate with an external actor system (e.g. an Akka-based receiver in spark streaming which communicates with an external system) If the external actor system fails or disassociates with the actor within spark's system with purpose, we may receive DisassociatedEvent and the executor is restarted. This is not the expected behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5267) Add a streaming module to ingest Apache Camel Messages from a configured endpoints
Steve Brewin created SPARK-5267: --- Summary: Add a streaming module to ingest Apache Camel Messages from a configured endpoints Key: SPARK-5267 URL: https://issues.apache.org/jira/browse/SPARK-5267 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.2.0 Reporter: Steve Brewin The number of input stream protocols supported by Spark Streaming is quite limited, which constrains the number of systems with which it can be integrated. This proposal solves the problem by adding an optional module that integrates Apache Camel, which support many more input protocols. Our tried and tested implementation of this proposal is spark-streaming-camel. An Apache Camel service is run on a separate Thread, consuming each http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html and storing it into Spark's memory. The provider of the Message is specified by any consuming component URI documented at http://camel.apache.org/components.html, making all of these protocols available to Spark Streaming. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5269) BlockManager.dataDeserialize always creates a new serializer instance
Ivan Vergiliev created SPARK-5269: - Summary: BlockManager.dataDeserialize always creates a new serializer instance Key: SPARK-5269 URL: https://issues.apache.org/jira/browse/SPARK-5269 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Ivan Vergiliev BlockManager.dataDeserialize always creates a new instance of the serializer, which is pretty slow in some cases. I'm using Kryo serialization and have a custom registrator, and its register method is showing up as taking about 15% of the execution time in my profiles. This started happening after I increased the number of keys in a job with a shuffle phase by a factor of 40. One solution I can think of is to create a ThreadLocal SerializerInstance for the defaultSerializer, and only create a new one if a custom serializer is passed in. AFAICT a custom serializer is passed only from DiskStore.getValues, and that, on the other hand, depends on the serializer passed to ExternalSorter. I don't know how often this is used, but I think this can still be a good solution for the standard use case. Oh, and also - ExternalSorter already has a SerializerInstance, so if the getValues method is called from a single thread, maybe we can pass that directly? I'd be happy to try a patch but would probably need a confirmation from someone that this approach would indeed work (or an idea for another). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD
[ https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278819#comment-14278819 ] Hamel Ajay Kothari commented on SPARK-5097: --- Am I correct in interpreting that this would allow us to trivially select columns at runtime since we'd just use {{SchemaRDD(stringColumnName)}}? In the world of catalyst selecting columns known only at runtime was a real pain because the only defined way to do it in the docs was to use quasiquotes or use {{SchemaRDD.baseLogicalPlan.resolve()}}. The first couldn't be defined at runtime (as far as I know) and the second required you to depend on expressions. Also, is there any way to control the name of the resulting columns from groupby+aggregate (or similar methods that add columns) in this plan? Adding data frame APIs to SchemaRDD --- Key: SPARK-5097 URL: https://issues.apache.org/jira/browse/SPARK-5097 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Priority: Critical Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf SchemaRDD, through its DSL, already provides common data frame functionalities. However, the DSL was originally created for constructing test cases without much end-user usability and API stability consideration. This design doc proposes a set of API changes for Scala and Python to make the SchemaRDD DSL API more usable and stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279170#comment-14279170 ] Muhammad-Ali A'rabi edited comment on SPARK-5226 at 1/15/15 7:33 PM: - This is DBSCAN algorithm: {noformat} DBSCAN(D, eps, MinPts) C = 0 for each unvisited point P in dataset D mark P as visited NeighborPts = regionQuery(P, eps) if sizeof(NeighborPts) MinPts mark P as NOISE else C = next cluster expandCluster(P, NeighborPts, C, eps, MinPts) expandCluster(P, NeighborPts, C, eps, MinPts) add P to cluster C for each point P' in NeighborPts if P' is not visited mark P' as visited NeighborPts' = regionQuery(P', eps) if sizeof(NeighborPts') = MinPts NeighborPts = NeighborPts joined with NeighborPts' if P' is not yet member of any cluster add P' to cluster C regionQuery(P, eps) return all points within P's eps-neighborhood (including P) {noformat} As you can see, there are just two parameters. There is two ways of implementation. First one is faster (O(n log n), and requires more memory (O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(n)). But I prefer the first one, as we are not short one memory. There are two phases of running: * Preprocessing. In this phase a distance matrix for all point is created and distances between every two points is calculated. Very parallel. * Main Process. In this phase the algorithm will run, as described in pseudo-code, and two foreach's are parallelized. Region queries are done very fast (O(1)), because of preprocessing. was (Author: angellandros): This is DBSCAN algorithm: {noformat} DBSCAN(D, eps, MinPts) C = 0 for each unvisited point P in dataset D mark P as visited NeighborPts = regionQuery(P, eps) if sizeof(NeighborPts) MinPts mark P as NOISE else C = next cluster expandCluster(P, NeighborPts, C, eps, MinPts) expandCluster(P, NeighborPts, C, eps, MinPts) add P to cluster C for each point P' in NeighborPts if P' is not visited mark P' as visited NeighborPts' = regionQuery(P', eps) if sizeof(NeighborPts') = MinPts NeighborPts = NeighborPts joined with NeighborPts' if P' is not yet member of any cluster add P' to cluster C regionQuery(P, eps) return all points within P's eps-neighborhood (including P) {noformat} As you can see, there are just two parameters. There is two ways of implementation. First one is faster (O(n log n), and requires more memory (O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(n)). But I prefer the first one, as we are not short one memory. There are two phases of running: * Preprocessing. In this phase a distance matrix for all point is created and distances between every two points is calculated. Very parallel. * Main Process. In this phase the algorithm will run, as described in pseudo-code, and two foreach's are parallelized. Region queries are done very fast (O(1)), because of preprocessing. Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Muhammad-Ali A'rabi Priority: Minor Labels: DBSCAN MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5273) Improve documentation examples for LinearRegression
Dev Lakhani created SPARK-5273: -- Summary: Improve documentation examples for LinearRegression Key: SPARK-5273 URL: https://issues.apache.org/jira/browse/SPARK-5273 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Dev Lakhani Priority: Minor In the document: https://spark.apache.org/docs/1.1.1/mllib-linear-methods.html Under Linear least squares, Lasso, and ridge regression The suggested method to use LinearRegressionWithSGD.train() // Building the model val numIterations = 100 val model = LinearRegressionWithSGD.train(parsedData, numIterations) is not ideal even for simple examples such as y=x. This should be replaced with more real world parameters with step size: val lr = new LinearRegressionWithSGD() lr.optimizer.setStepSize(0.0001) lr.optimizer.setNumIterations(100) or LinearRegressionWithSGD.train(input,100,0.0001) To create a reasonable MSE. It took me a while using the dev forum to learn that the step size should be really small. Might help save someone the same effort when learning mllib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279251#comment-14279251 ] Joseph K. Bradley commented on SPARK-5012: -- [~MeethuMathew], [~tgaloppo] makes a good point. It might actually be best to make a Python API for MultivariateGaussian first, and then to do this JIRA. (Since we don't want to require scipy currently, we can't use the existing scipy.stats.multivariate_normal class.) Python API for Gaussian Mixture Model - Key: SPARK-5012 URL: https://issues.apache.org/jira/browse/SPARK-5012 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Meethu Mathew Add Python API for the Scala implementation of GMM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5274) Stabilize UDFRegistration API
[ https://issues.apache.org/jira/browse/SPARK-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279352#comment-14279352 ] Apache Spark commented on SPARK-5274: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4056 Stabilize UDFRegistration API - Key: SPARK-5274 URL: https://issues.apache.org/jira/browse/SPARK-5274 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin 1. Removed UDFRegistration as a mixin in SQLContext and made it a field (udf). This removes 45 methods from SQLContext. 2. For Java UDFs, renamed dataType to returnType. 3. For Scala UDFs, added type tags. 4. Added all Java UDF registration methods to Scala's UDFRegistration. 5. Better documentation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5274) Stabilize UDFRegistration API
Reynold Xin created SPARK-5274: -- Summary: Stabilize UDFRegistration API Key: SPARK-5274 URL: https://issues.apache.org/jira/browse/SPARK-5274 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin 1. Removed UDFRegistration as a mixin in SQLContext and made it a field (udf). This removes 45 methods from SQLContext. 2. For Java UDFs, renamed dataType to returnType. 3. For Scala UDFs, added type tags. 4. Added all Java UDF registration methods to Scala's UDFRegistration. 5. Better documentation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features
[ https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279235#comment-14279235 ] Joseph K. Bradley edited comment on SPARK-5272 at 1/15/15 8:13 PM: --- My initial thoughts: (1) Are continuous labels/features important to support? In terms of when NB *should* be used, I believe they are important. People use Logistic Regression with continuous labels and features, and Naive Bayes is really the same type of model (just trained differently). * E.g.: Ng Jordan. On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes. NIPS 2002. ** Theoretically, the 2 types of models have the same purpose, but they should be used in different regimes. In terms of when NB is actually used by Spark users, I'm not sure. Hopefully some research and discussion here will make that clearer. (2) What should the API look like? I believe there should be a NaiveBayesClassifier and NaiveBayesRegressor which use the same underlying implementation. That implementation should include a Factor concept encoding the type of distribution. This should be simple to do for Naive Bayes, and it will give some guidance if we move to support more general probabilistic graphical models in MLlib. was (Author: josephkb): My initial thoughts: (1) Are continuous labels/features important to support? In terms of when NB *should* be used, I believe they are important. People use Logistic Regression with continuous labels and features, and Naive Bayes is really the same type of model (just trained differently). * E.g.: Ng Jordan. On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes. NIPS 2002. ** Theoretically, the 2 types of models have the same purpose, but they should be used in different regimes. (2) What should the API look like? I believe there should be a NaiveBayesClassifier and NaiveBayesRegressor which use the same underlying implementation. That implementation should include a Factor concept encoding the type of distribution. This should be simple to do for Naive Bayes, and it will give some guidance if we move to support more general probabilistic graphical models in MLlib. Refactor NaiveBayes to support discrete and continuous labels,features -- Key: SPARK-5272 URL: https://issues.apache.org/jira/browse/SPARK-5272 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley This JIRA is to discuss refactoring NaiveBayes in order to support both discrete and continuous labels and features. Currently, NaiveBayes supports only discrete labels and features. Proposal: Generalize it to support continuous values as well. Some items to discuss are: * How commonly are continuous labels/features used in practice? (Is this necessary?) * What should the API look like? ** E.g., should NB have multiple classes for each type of label/feature, or should it take a general Factor type parameter? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279250#comment-14279250 ] RJ Nowling commented on SPARK-4894: --- Thanks, [~josephkb]! I'd be happy to help with the NB refactoring too :) Add Bernoulli-variant of Naive Bayes Key: SPARK-4894 URL: https://issues.apache.org/jira/browse/SPARK-4894 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.0 Reporter: RJ Nowling Assignee: RJ Nowling MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli version of Naive Bayes is more useful for situations where the features are binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features
[ https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279258#comment-14279258 ] RJ Nowling commented on SPARK-5272: --- Hi [~josephkb], I can see benefits to your suggestions of feature types (e.g., categorial, discrete counts, continuous, binary, etc.). If we created corresponding FeatureLikelihood types (e.g., Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which would be easier to test, debug, and maintain versus multiple NB subclasses like sklearn. Additionally, if the user can define a type for each feature, then users can mix and match likelihood types as well. Most NB implementations treat all features the same -- what if we had a model that allowed heterozygous features? If it works well in NB, it could be extended to other parts of MLlib. (There is likely some overlap with decision trees since they support multiple feature types, so we might want to see if there is anything there we can reuse.) At the API level, we could provide a basic API which takes {noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity isn't compromised and provide a more advanced API for power users. Does this sound like I'm understanding you correctly? Re: Decision trees. Decision tree models generally support different types of features (categorical, binary, discrete, continuous). Does Spark's decision tree implementation support those different types? How are they handled? Do they abstract the feature type? I feel there could be common ground here. Refactor NaiveBayes to support discrete and continuous labels,features -- Key: SPARK-5272 URL: https://issues.apache.org/jira/browse/SPARK-5272 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley This JIRA is to discuss refactoring NaiveBayes in order to support both discrete and continuous labels and features. Currently, NaiveBayes supports only discrete labels and features. Proposal: Generalize it to support continuous values as well. Some items to discuss are: * How commonly are continuous labels/features used in practice? (Is this necessary?) * What should the API look like? ** E.g., should NB have multiple classes for each type of label/feature, or should it take a general Factor type parameter? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279274#comment-14279274 ] Joseph K. Bradley edited comment on SPARK-1405 at 1/15/15 9:29 PM: --- I'll try out the statmt dataset if that will be easier for everyone to access. UPDATE: Note: The statmt dataset is an odd one since each document is a single sentence. I'll still try it since I could imagine a lot of users wanting to run LDA on tweets or other short documents, but I might continue with my previous tests first. was (Author: josephkb): I'll try out the statmt dataset if that will be easier for everyone to access. parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5224) parallelize list/ndarray is really slow
[ https://issues.apache.org/jira/browse/SPARK-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5224. --- Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 Issue resolved by pull request 4024 [https://github.com/apache/spark/pull/4024] parallelize list/ndarray is really slow --- Key: SPARK-5224 URL: https://issues.apache.org/jira/browse/SPARK-5224 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Davies Liu Priority: Blocker Fix For: 1.3.0, 1.2.1 After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1. Also, BatchedSerializer did not work well with list and numpy.ndarray -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5224) parallelize list/ndarray is really slow
[ https://issues.apache.org/jira/browse/SPARK-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5224: -- Assignee: Davies Liu parallelize list/ndarray is really slow --- Key: SPARK-5224 URL: https://issues.apache.org/jira/browse/SPARK-5224 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Fix For: 1.3.0, 1.2.1 After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1. Also, BatchedSerializer did not work well with list and numpy.ndarray -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features
Joseph K. Bradley created SPARK-5272: Summary: Refactor NaiveBayes to support discrete and continuous labels,features Key: SPARK-5272 URL: https://issues.apache.org/jira/browse/SPARK-5272 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley This JIRA is to discuss refactoring NaiveBayes in order to support both discrete and continuous labels and features. Currently, NaiveBayes supports only discrete labels and features. Proposal: Generalize it to support continuous values as well. Some items to discuss are: * How commonly are continuous labels/features used in practice? (Is this necessary?) * What should the API look like? ** E.g., should NB have multiple classes for each type of label/feature, or should it take a general Factor type parameter? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279241#comment-14279241 ] Joseph K. Bradley commented on SPARK-4894: -- [~rnowling] I too don't want to hold up the Bernoulli NB too much. I just made linked a JIRA per your suggestion [https://issues.apache.org/jira/browse/SPARK-5272]. I'll add my thoughts there (and feel free to copy yours there too). I'm not sure if we can reuse much from decision trees since they are not probabilistic models and have a different concept of loss or error. For now, generalizing the existing Naive Bayes class to handle the Bernoulli case sounds good. Thanks for taking the time to discuss this! Add Bernoulli-variant of Naive Bayes Key: SPARK-4894 URL: https://issues.apache.org/jira/browse/SPARK-4894 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.0 Reporter: RJ Nowling Assignee: RJ Nowling MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli version of Naive Bayes is more useful for situations where the features are binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5
[ https://issues.apache.org/jira/browse/SPARK-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279216#comment-14279216 ] Apache Spark commented on SPARK-5111: - User 'zhzhan' has created a pull request for this issue: https://github.com/apache/spark/pull/4064 HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5 --- Key: SPARK-5111 URL: https://issues.apache.org/jira/browse/SPARK-5111 Project: Spark Issue Type: Bug Reporter: Zhan Zhang Due to java.lang.NoSuchFieldError: SASL_PROPS error. Need to backport some hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 support in spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4746) integration tests should be separated from faster unit tests
[ https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279524#comment-14279524 ] Imran Rashid commented on SPARK-4746: - This doesn't work as well as I thought -- all of the junit tests get skipped. The problem is a mismatch between the way test args are handled by the junit test runner and the scalatest runner. I think our options are: 1) abandon a tag-based approach: just use directories / file names to separate out unit tests integration tests 2) change all of our junit tests to scalatest. (its perfectly fine to test java code w/ scalatest.) 3) See if we can get scalatest to also run our junit tests 4) change the sbt task to first run scalatest, with all junit tests turned off, and then just run the junit tests, so that we can pass in different args to each one. 5) just live w/ the fact that the junit tests never match the tags so they are effectively considered integration tests. Note that junit has a notion similar to tags in categories: https://github.com/junit-team/junit/wiki/Categories The main problem here is the difference in the args for the two test runners. integration tests should be separated from faster unit tests Key: SPARK-4746 URL: https://issues.apache.org/jira/browse/SPARK-4746 Project: Spark Issue Type: Bug Reporter: Imran Rashid Priority: Trivial Currently there isn't a good way for a developer to skip the longer integration tests. This can slow down local development. See http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html One option is to use scalatest's notion of test tags to tag all integration tests, so they could easily be skipped -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5274) Stabilize UDFRegistration API
[ https://issues.apache.org/jira/browse/SPARK-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5274. Resolution: Fixed Fix Version/s: 1.3.0 Stabilize UDFRegistration API - Key: SPARK-5274 URL: https://issues.apache.org/jira/browse/SPARK-5274 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.3.0 1. Removed UDFRegistration as a mixin in SQLContext and made it a field (udf). This removes 45 methods from SQLContext. 2. For Java UDFs, renamed dataType to returnType. 3. For Scala UDFs, added type tags. 4. Added all Java UDF registration methods to Scala's UDFRegistration. 5. Better documentation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5144) spark-yarn module should be published
[ https://issues.apache.org/jira/browse/SPARK-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279457#comment-14279457 ] Matthew Sanders commented on SPARK-5144: +1 -- I am in a similar situation and would love to see this addressed somehow. spark-yarn module should be published - Key: SPARK-5144 URL: https://issues.apache.org/jira/browse/SPARK-5144 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Aniket Bhatnagar We disabled publishing of certain modules in SPARK-3452. One of such modules is spark-yarn. This breaks applications that submit spark jobs programatically with master set as yarn-client. This is because SparkContext is dependent on classes from yarn-client module to submit the YARN application. Here is the stack trace that you get if you submit the spark job without yarn-client dependency: 2015-01-07 14:39:22,799 [pool-10-thread-13] [info] o.a.s.s.MemoryStore - MemoryStore started with capacity 731.7 MB Exception in thread pool-10-thread-13 java.lang.ExceptionInInitializerError at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784) at org.apache.spark.storage.BlockManager.init(BlockManager.scala:105) at org.apache.spark.storage.BlockManager.init(BlockManager.scala:180) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159) at org.apache.spark.SparkContext.init(SparkContext.scala:232) at com.myimpl.Server:23) at scala.util.Success$$anonfun$map$1.apply(Try.scala:236) at scala.util.Try$.apply(Try.scala:191) at scala.util.Success.map(Try.scala:236) at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23) at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23) at scala.util.Success$$anonfun$map$1.apply(Try.scala:236) at scala.util.Try$.apply(Try.scala:191) at scala.util.Success.map(Try.scala:236) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235) at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Unable to load YARN support at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:199) at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:194) at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala) ... 27 more Caused by: java.lang.ClassNotFoundException: org.apache.spark.deploy.yarn.YarnSparkHadoopUtil at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:195) ... 29 more -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279406#comment-14279406 ] Josh Rosen commented on SPARK-4879: --- I'm not sure that SparkHadoopWriter's use of FileOutputCommitter properly obeys the OutputCommitter contracts in Hadoop. According to the [OutputCommitter Javadoc|https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/OutputCommitter.html] {quote} The methods in this class can be called from several different processes and from several different contexts. It is important to know which process and which context each is called from. Each method should be marked accordingly in its documentation. It is also important to note that not all methods are guaranteed to be called once and only once. If a method is not guaranteed to have this property the output committer needs to handle this appropriately. Also note it will only be in rare situations where they may be called multiple times for the same task. {quote} Based on the documentation, `needsTaskCommit` is called from each individual task's process that will output to HDFS, and it is called just for that task., so it seems like it should be safe to call this from SparkHadoopWriter. However, maybe we're misusing the `commitTask` method: {quote} If needsTaskCommit(TaskAttemptContext) returns true and this task is the task that the AM determines finished first, this method is called to commit an individual task's output. This is to mark that tasks output as complete, as commitJob(JobContext) will also be called later on if the entire job finished successfully. This is called from a task's process. This may be called multiple times for the same task, but different task attempts. It should be very rare for this to be called multiple times and requires odd networking failures to make this happen. In the future the Hadoop framework may eliminate this race. {quote} I think that we're missing the this task is the task that the AM determines finished first part of the equation here. If `needsTaskCommit` is false, then we definitely shouldn't commit (e.g. if it's an original task that lost to a speculated copy), but if it's true then I don't think it's safe to commit; we need some central authority to pick a winner. Let's see how Hadoop does things, working backwards from actual calls of `commitTask` to see whether they're guarded by some coordination through the AM. It looks like `OutputCommitter` is part of the `mapred` API, so I'll only look at classes in that package: In `Task.java`, `committer.commitTask` is only performed after checking `canCommit` through `TaskUmbilicalProtocol`: https://github.com/apache/hadoop/blob/a655973e781caf662b360c96e0fa3f5a873cf676/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L1185. According to the Javadocs for TaskAttemptListenerImpl.canCommit (the actual concrete implementation of this method): {code} /** * Child checking whether it can commit. * * br/ * Commit is a two-phased protocol. First the attempt informs the * ApplicationMaster that it is * {@link #commitPending(TaskAttemptID, TaskStatus)}. Then it repeatedly polls * the ApplicationMaster whether it {@link #canCommit(TaskAttemptID)} This is * a legacy from the centralized commit protocol handling by the JobTracker. */ @Override public boolean canCommit(TaskAttemptID taskAttemptID) throws IOException { {code} This ends up delegating to `Task.canCommit()`: {code} /** * Can the output of the taskAttempt be committed. Note that once the task * gives a go for a commit, further canCommit requests from any other attempts * should return false. * * @param taskAttemptID * @return whether the attempt's output can be committed or not. */ boolean canCommit(TaskAttemptId taskAttemptID); {code} There's a bunch of tricky logic that involves communication with the AM (see AttemptCommitPendingTransition and the other transitions in TaskImpl), but it looks like the gist is that the winner is picked by the AM through some central coordination process. So, it looks like the right fix is to implement these same state transitions ourselves. It would be nice if there was a clean way to do this that could be easily backported to maintenance branches. Missing output partitions after job completes with speculative execution Key: SPARK-4879 URL: https://issues.apache.org/jira/browse/SPARK-4879 Project: Spark Issue Type: Bug Components: Input/Output, Spark Core Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical
[jira] [Created] (SPARK-5275) pyspark.streaming is not included in assembly jar
Davies Liu created SPARK-5275: - Summary: pyspark.streaming is not included in assembly jar Key: SPARK-5275 URL: https://issues.apache.org/jira/browse/SPARK-5275 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Priority: Blocker The pyspark.streaming is not included in assembly jar of spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5276) pyspark.streaming is not included in assembly jar
Davies Liu created SPARK-5276: - Summary: pyspark.streaming is not included in assembly jar Key: SPARK-5276 URL: https://issues.apache.org/jira/browse/SPARK-5276 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0, 1.3.0 Reporter: Davies Liu Priority: Blocker The pyspark.streaming is not included in assembly jar of spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279274#comment-14279274 ] Joseph K. Bradley commented on SPARK-1405: -- I'll try out the statmt dataset if that will be easier for everyone to access. parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib - Key: SPARK-1405 URL: https://issues.apache.org/jira/browse/SPARK-1405 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xusen Yin Assignee: Guoqiang Li Priority: Critical Labels: features Attachments: performance_comparison.png Original Estimate: 336h Remaining Estimate: 336h Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts topics from text corpus. Different with current machine learning algorithms in MLlib, instead of using optimization algorithms such as gradient desent, LDA uses expectation algorithms such as Gibbs sampling. In this PR, I prepare a LDA implementation based on Gibbs sampling, with a wholeTextFiles API (solved yet), a word segmentation (import from Lucene), and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3622) Provide a custom transformation that can output multiple RDDs
[ https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279602#comment-14279602 ] Imran Rashid commented on SPARK-3622: - In some ways this kinda reminds of the problem w/ accumulators and lazy transformations. Accumulators are basically multiple output, but Spark itself provides no way to track when that output is ready. Its up to the developer to figure it out. If you do a transformation on {{rddA}} you've got to know to wait until you've also got a transformation on {{rddB}} ready as well. Probably the simplest case for this is filtering records by some condition, but keeping both the good and bad records, ala scala collection's {{partition}} method. I think this has come up on the user mailing list a few times. What about having some new type {{MultiRDD}}, which only runs when you've queued up an action on *all* RDDs? eg. something like: {code} val input: RDD[String] = ... val goodAndBad: MultiRdd[String, String] = input.partition{ str = MyRecordParser.isOk(str)} val bad: RDD[String] = goodAndBad.get(1) bad.saveAsTextFile(...) // doesn't do anything yet val parsed: RDD[MyCaseClass] = goodAndBad.get(0).map{str = MyRecordParser.parse(str)} val tmp: RDD[MyCaseClass] = parsed.map{f1}.filter{f2}.mapPartitions{f3} //still don't do anything ... val result = tmp.reduce{reduceFunc} // now everything gets run {code} Provide a custom transformation that can output multiple RDDs - Key: SPARK-3622 URL: https://issues.apache.org/jira/browse/SPARK-3622 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Xuefu Zhang All existing transformations return just one RDD at most, even for those which takes user-supplied functions such as mapPartitions() . However, sometimes a user provided function may need to output multiple RDDs. For instance, a filter function that divides the input RDD into serveral RDDs. While it's possible to get multiple RDDs by transforming the same RDD multiple times, it may be more efficient to do this concurrently in one shot. Especially user's existing function is already generating different data sets. This the case in Hive on Spark, where Hive's map function and reduce function can output different data sets to be consumed by subsequent stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators
[ https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Seiden updated SPARK-5277: -- Remaining Estimate: (was: 24h) Original Estimate: (was: 24h) SparkSqlSerializer does not register user specified KryoRegistrators - Key: SPARK-5277 URL: https://issues.apache.org/jira/browse/SPARK-5277 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Max Seiden Although the SparkSqlSerializer class extends the KryoSerializer in core, it's overridden newKryo() does not call super.newKryo(). This results in inconsistent serializer behaviors depending on whether a KryoSerializer instance or a SparkSqlSerializer instance is used. This may also be related to the TODO in KryoResourcePool, which uses KryoSerializer instead of SparkSqlSerializer due to yet-to-be-investigated test failures. An example of the divergence in behavior: The Exchange operator creates a new SparkSqlSerializer instance (with an empty conf; another issue) when it is constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the resource pool (see above). The result is that the serialized in-memory columns are created using the user provided serializers / registrators, while serialization during exchange does not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators
Max Seiden created SPARK-5277: - Summary: SparkSqlSerializer does not register user specified KryoRegistrators Key: SPARK-5277 URL: https://issues.apache.org/jira/browse/SPARK-5277 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Max Seiden Although the SparkSqlSerializer class extends the KryoSerializer in core, it's overridden newKryo() does not call super.newKryo(). This results in inconsistent serializer behaviors depending on whether a KryoSerializer instance or a SparkSqlSerializer instance is used. This may also be related to the TODO in KryoResourcePool, which uses KryoSerializer instead of SparkSqlSerializer due to yet-to-be-investigated test failures. An example of the divergence in behavior: The Exchange operator creates a new SparkSqlSerializer instance (with an empty conf; another issue) when it is constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the resource pool (see above). The result is that the serialized in-memory columns are created using the user provided serializers / registrators, while serialization during exchange does not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5011) Add support for WITH SERDEPROPERTIES, TBLPROPERTIES in CREATE TEMPORARY TABLE
[ https://issues.apache.org/jira/browse/SPARK-5011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shengli closed SPARK-5011. -- Resolution: Later Add support for WITH SERDEPROPERTIES, TBLPROPERTIES in CREATE TEMPORARY TABLE - Key: SPARK-5011 URL: https://issues.apache.org/jira/browse/SPARK-5011 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0, 1.2.1 Reporter: shengli Priority: Minor Fix For: 1.2.1 Original Estimate: 96h Remaining Estimate: 96h For external datasource integration. We have two kinds of datasource: 1. File : like avro, json, parquet, etc.. 2. Database: like hbase, cassandra etc... For `File`, there is not too much configurations. Using Options Syntax is ok. But for Database we usually have many configuration in different levels. We need to support `WITH SERDEPROPERTIES` and `TBLPROPERTIES` syntax. Like Hive HBase: ``` CREATE TABLE hbase_table_1(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES (hbase.columns.mapping = :key,cf1:val) TBLPROPERTIES (hbase.table.name = xyz); ``` refer links: https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4857) Add Executor Events to SparkListener
[ https://issues.apache.org/jira/browse/SPARK-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-4857. Resolution: Fixed Fix Version/s: 1.3.0 Add Executor Events to SparkListener Key: SPARK-4857 URL: https://issues.apache.org/jira/browse/SPARK-4857 Project: Spark Issue Type: Improvement Reporter: Kostas Sakellis Assignee: Kostas Sakellis Fix For: 1.3.0 We need to add events to the SparkListener to indicate an executor has been added or removed with corresponding information. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279711#comment-14279711 ] Apache Spark commented on SPARK-4879: - User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/4066 Missing output partitions after job completes with speculative execution Key: SPARK-4879 URL: https://issues.apache.org/jira/browse/SPARK-4879 Project: Spark Issue Type: Bug Components: Input/Output, Spark Core Affects Versions: 1.0.2, 1.1.1, 1.2.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical Attachments: speculation.txt, speculation2.txt When speculative execution is enabled ({{spark.speculation=true}}), jobs that save output files may report that they have completed successfully even though some output partitions written by speculative tasks may be missing. h3. Reproduction This symptom was reported to me by a Spark user and I've been doing my own investigation to try to come up with an in-house reproduction. I'm still working on a reliable local reproduction for this issue, which is a little tricky because Spark won't schedule speculated tasks on the same host as the original task, so you need an actual (or containerized) multi-host cluster to test speculation. Here's a simple reproduction of some of the symptoms on EC2, which can be run in {{spark-shell}} with {{--conf spark.speculation=true}}: {code} // Rig a job such that all but one of the tasks complete instantly // and one task runs for 20 seconds on its first attempt and instantly // on its second attempt: val numTasks = 100 sc.parallelize(1 to numTasks, numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) = if (ctx.partitionId == 0) { // If this is the one task that should run really slow if (ctx.attemptId == 0) { // If this is the first attempt, run slow Thread.sleep(20 * 1000) } } iter }.map(x = (x, x)).saveAsTextFile(/test4) {code} When I run this, I end up with a job that completes quickly (due to speculation) but reports failures from the speculated task: {code} [...] 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal (100/100) 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at console:22) finished in 0.856 s 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at console:22, took 0.885438374 s 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event for 70.1 in stage 3.0 because task 70 has already completed successfully scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): java.io.IOException: Failed to save output of task: attempt_201412110141_0003_m_49_413 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} One interesting thing to note about this stack trace: if we look at {{FileOutputCommitter.java:160}} ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]), this point in the execution seems to correspond to a case where a task completes, attempts to commit its output, fails for some reason, then deletes the destination file, tries again, and fails: {code} if (fs.isFile(taskOutput)) { 152 Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput, 153 getTempTaskOutputPath(context)); 154 if (!fs.rename(taskOutput, finalOutputPath)) {
[jira] [Commented] (SPARK-4874) Report number of records read/written in a task
[ https://issues.apache.org/jira/browse/SPARK-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279712#comment-14279712 ] Apache Spark commented on SPARK-4874: - User 'ksakellis' has created a pull request for this issue: https://github.com/apache/spark/pull/4067 Report number of records read/written in a task --- Key: SPARK-4874 URL: https://issues.apache.org/jira/browse/SPARK-4874 Project: Spark Issue Type: Improvement Reporter: Kostas Sakellis Assignee: Kostas Sakellis This metric will help us find key skew using the WebUI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model
[ https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279811#comment-14279811 ] Meethu Mathew commented on SPARK-5012: -- Once SPARK-5019 is resolved, we will make the changes accordingly.Thanks [~josephkb] [~tgaloppo] for the comments Python API for Gaussian Mixture Model - Key: SPARK-5012 URL: https://issues.apache.org/jira/browse/SPARK-5012 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Meethu Mathew Add Python API for the Scala implementation of GMM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5278) ambiguous reference to fields in Spark SQL is incompleted
Wenchen Fan created SPARK-5278: -- Summary: ambiguous reference to fields in Spark SQL is incompleted Key: SPARK-5278 URL: https://issues.apache.org/jira/browse/SPARK-5278 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan for json string like {a:[ { b: 1, B: 2 } } The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {a:[ { b: 1, B: 2 }] } The SQL `SELECT a[0].b from t` will pass and pick the first `b` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5278) ambiguous reference to fields in Spark SQL is incompleted
[ https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-5278: --- Description: for json string like {a: {b: 1, B: 2}} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {a: [{b: 1, B: 2}]} The SQL `SELECT a[0].b from t` will pass and pick the first `b` was: for json string like {a: {b: 1, B: 2}} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {a:[{b: 1, B: 2}]} The SQL `SELECT a[0].b from t` will pass and pick the first `b` ambiguous reference to fields in Spark SQL is incompleted - Key: SPARK-5278 URL: https://issues.apache.org/jira/browse/SPARK-5278 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan for json string like {a: {b: 1, B: 2}} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {a: [{b: 1, B: 2}]} The SQL `SELECT a[0].b from t` will pass and pick the first `b` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5278) ambiguous reference to fields in Spark SQL is incompleted
[ https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-5278: --- Description: for json string like {a: {b: 1, B: 2}} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {a:[{b: 1, B: 2}]} The SQL `SELECT a[0].b from t` will pass and pick the first `b` was: for json string like {a:[ { b: 1, B: 2 } } The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {a:[ { b: 1, B: 2 }] } The SQL `SELECT a[0].b from t` will pass and pick the first `b` ambiguous reference to fields in Spark SQL is incompleted - Key: SPARK-5278 URL: https://issues.apache.org/jira/browse/SPARK-5278 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan for json string like {a: {b: 1, B: 2}} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {a:[{b: 1, B: 2}]} The SQL `SELECT a[0].b from t` will pass and pick the first `b` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2630) Input data size of CoalescedRDD is incorrect
[ https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2630. Resolution: Duplicate I think this is a dup of SPARK-4092. Input data size of CoalescedRDD is incorrect Key: SPARK-2630 URL: https://issues.apache.org/jira/browse/SPARK-2630 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.0.0, 1.0.1 Reporter: Davies Liu Assignee: Andrew Ash Priority: Blocker Attachments: overflow.tiff Given one big file, such as text.4.3G, put it in one task, {code} sc.textFile(text.4.3.G).coalesce(1).count() {code} In Web UI of Spark, you will see that the input size is 5.4M. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4955) Dynamic allocation doesn't work in YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4955: --- Priority: Blocker (was: Critical) Dynamic allocation doesn't work in YARN cluster mode Key: SPARK-4955 URL: https://issues.apache.org/jira/browse/SPARK-4955 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Chengxiang Li Assignee: Lianhui Wang Priority: Blocker With executor dynamic scaling enabled, in yarn-cluster mode, after query finished and spark.dynamicAllocation.executorIdleTimeout interval, executor number is not reduced to configured min number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4955) Dynamic allocation doesn't work in YARN cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-4955: --- Target Version/s: 1.3.0 Dynamic allocation doesn't work in YARN cluster mode Key: SPARK-4955 URL: https://issues.apache.org/jira/browse/SPARK-4955 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Chengxiang Li Assignee: Lianhui Wang Priority: Blocker With executor dynamic scaling enabled, in yarn-cluster mode, after query finished and spark.dynamicAllocation.executorIdleTimeout interval, executor number is not reduced to configured min number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5216) Spark Ui should report estimated time remaining for each stage.
[ https://issues.apache.org/jira/browse/SPARK-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279863#comment-14279863 ] Patrick Wendell commented on SPARK-5216: This has been proposed before, but in the past we decided not to do it. Trying to extrapolate the finish time of a stage accurately is basically impossible since in many workloads stragglers dominate the total response time. The conclusion was that it was better to give no estimate rather than one which is likely to be misleading. Spark Ui should report estimated time remaining for each stage. --- Key: SPARK-5216 URL: https://issues.apache.org/jira/browse/SPARK-5216 Project: Spark Issue Type: Wish Components: Spark Core, Web UI Affects Versions: 1.3.0 Reporter: Prashant Sharma Assignee: Prashant Sharma Per stage feedback on estimated remaining time can help user get a grasp on how much time the job is going to take. This will only require changes on the UI/JobProgressListener side of code since we already have most of the information needed. In the initial cut, plan is to estimate time based on statistics of running job i.e. average time taken by each task and number of task per stage. This will makes sense when jobs are long. And then if this makes sense, then more heuristics can be added like projected time saved if the rdd is cached and so on. More precise details will come as this evolves. In the meantime thoughts on alternate ways and suggestion on usefulness are welcome. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5279) Use java.math.BigDecimal as the exposed Decimal type
Reynold Xin created SPARK-5279: -- Summary: Use java.math.BigDecimal as the exposed Decimal type Key: SPARK-5279 URL: https://issues.apache.org/jira/browse/SPARK-5279 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Change it from scala.BigDecimal to java.math.BigDecimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5278) ambiguous reference to fields in Spark SQL is incompleted
[ https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-5278: --- Description: for json string like {a: {b: 1, B: 2}} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {a: [ {b: 1, B: 2} ]} The SQL `SELECT a[0].b from t` will pass and pick the first `b` was: for json string like {a: {b: 1, B: 2}} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {a: [{b: 1, B: 2}]} The SQL `SELECT a[0].b from t` will pass and pick the first `b` ambiguous reference to fields in Spark SQL is incompleted - Key: SPARK-5278 URL: https://issues.apache.org/jira/browse/SPARK-5278 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan for json string like {a: {b: 1, B: 2}} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {a: [ {b: 1, B: 2} ]} The SQL `SELECT a[0].b from t` will pass and pick the first `b` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5176) Thrift server fails with confusing error message when deploy-mode is cluster
[ https://issues.apache.org/jira/browse/SPARK-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-5176: --- Labels: starter (was: ) Thrift server fails with confusing error message when deploy-mode is cluster Key: SPARK-5176 URL: https://issues.apache.org/jira/browse/SPARK-5176 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.2.0 Reporter: Tom Panning Labels: starter With Spark 1.2.0, when I try to run {noformat} $SPARK_HOME/sbin/start-thriftserver.sh --deploy-mode cluster --master spark://xd-spark.xdata.data-tactics-corp.com:7077 {noformat} The log output is {noformat} Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark Command: /usr/java/latest/bin/java -cp ::/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --deploy-mode cluster --master spark://xd-spark.xdata.data-tactics-corp.com:7077 spark-internal Jar url 'spark-internal' is not in valid format. Must be a jar file path in URL format (e.g. hdfs://host:port/XX.jar, file:///XX.jar) Usage: DriverClient [options] launch active-master jar-url main-class [driver options] Usage: DriverClient kill active-master driver-id Options: -c CORES, --cores CORESNumber of cores to request (default: 1) -m MEMORY, --memory MEMORY Megabytes of memory to request (default: 512) -s, --superviseWhether to restart the driver on failure -v, --verbose Print more debugging output Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties {noformat} I do not get this error if deploy-mode is set to client. The --deploy-mode option is described by the --help output, so I expected it to work. I checked, and this behavior seems to be present in Spark 1.1.0 as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5176) Thrift server fails with confusing error message when deploy-mode is cluster
[ https://issues.apache.org/jira/browse/SPARK-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279869#comment-14279869 ] Patrick Wendell commented on SPARK-5176: Yes, we should add a check here similar to the existing ones for the thriftserver class: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L143 Thrift server fails with confusing error message when deploy-mode is cluster Key: SPARK-5176 URL: https://issues.apache.org/jira/browse/SPARK-5176 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.2.0 Reporter: Tom Panning Labels: starter With Spark 1.2.0, when I try to run {noformat} $SPARK_HOME/sbin/start-thriftserver.sh --deploy-mode cluster --master spark://xd-spark.xdata.data-tactics-corp.com:7077 {noformat} The log output is {noformat} Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark Command: /usr/java/latest/bin/java -cp ::/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --deploy-mode cluster --master spark://xd-spark.xdata.data-tactics-corp.com:7077 spark-internal Jar url 'spark-internal' is not in valid format. Must be a jar file path in URL format (e.g. hdfs://host:port/XX.jar, file:///XX.jar) Usage: DriverClient [options] launch active-master jar-url main-class [driver options] Usage: DriverClient kill active-master driver-id Options: -c CORES, --cores CORESNumber of cores to request (default: 1) -m MEMORY, --memory MEMORY Megabytes of memory to request (default: 512) -s, --superviseWhether to restart the driver on failure -v, --verbose Print more debugging output Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties {noformat} I do not get this error if deploy-mode is set to client. The --deploy-mode option is described by the --help output, so I expected it to work. I checked, and this behavior seems to be present in Spark 1.1.0 as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5278) ambiguous reference to fields in Spark SQL is incompleted
[ https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-5278: --- Description: at hive context for json string like {a: {b: 1, B: 2}} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {a: [ {b: 1, B: 2} ]} The SQL `SELECT a[0].b from t` will pass and pick the first `b` was: for json string like {a: {b: 1, B: 2}} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {a: [ {b: 1, B: 2} ]} The SQL `SELECT a[0].b from t` will pass and pick the first `b` ambiguous reference to fields in Spark SQL is incompleted - Key: SPARK-5278 URL: https://issues.apache.org/jira/browse/SPARK-5278 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan at hive context for json string like {a: {b: 1, B: 2}} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {a: [ {b: 1, B: 2} ]} The SQL `SELECT a[0].b from t` will pass and pick the first `b` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5176) Thrift server fails with confusing error message when deploy-mode is cluster
[ https://issues.apache.org/jira/browse/SPARK-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279869#comment-14279869 ] Patrick Wendell edited comment on SPARK-5176 at 1/16/15 6:28 AM: - Yes, we should add a check here similar to the existing ones for the thriftserver class: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L143 [~tpanning] are you interested in contributing this? If not, someone else will pick it up. was (Author: pwendell): Yes, we should add a check here similar to the existing ones for the thriftserver class: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L143 Thrift server fails with confusing error message when deploy-mode is cluster Key: SPARK-5176 URL: https://issues.apache.org/jira/browse/SPARK-5176 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0, 1.2.0 Reporter: Tom Panning Labels: starter With Spark 1.2.0, when I try to run {noformat} $SPARK_HOME/sbin/start-thriftserver.sh --deploy-mode cluster --master spark://xd-spark.xdata.data-tactics-corp.com:7077 {noformat} The log output is {noformat} Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark Command: /usr/java/latest/bin/java -cp ::/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --deploy-mode cluster --master spark://xd-spark.xdata.data-tactics-corp.com:7077 spark-internal Jar url 'spark-internal' is not in valid format. Must be a jar file path in URL format (e.g. hdfs://host:port/XX.jar, file:///XX.jar) Usage: DriverClient [options] launch active-master jar-url main-class [driver options] Usage: DriverClient kill active-master driver-id Options: -c CORES, --cores CORESNumber of cores to request (default: 1) -m MEMORY, --memory MEMORY Megabytes of memory to request (default: 512) -s, --superviseWhether to restart the driver on failure -v, --verbose Print more debugging output Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties {noformat} I do not get this error if deploy-mode is set to client. The --deploy-mode option is described by the --help output, so I expected it to work. I checked, and this behavior seems to be present in Spark 1.1.0 as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5278) ambiguous reference to fields in Spark SQL is incompleted
[ https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-5278: --- Description: at hive context for json string like {code}{a: {b: 1, B: 2}}{code} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {code}{a: [{b: 1, B: 2}]}{code} The SQL `SELECT a[0].b from t` will pass and pick the first `b` was: at hive context for json string like {a: {b: 1, B: 2}} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {a: [ {b: 1, B: 2} ]} The SQL `SELECT a[0].b from t` will pass and pick the first `b` ambiguous reference to fields in Spark SQL is incompleted - Key: SPARK-5278 URL: https://issues.apache.org/jira/browse/SPARK-5278 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan at hive context for json string like {code}{a: {b: 1, B: 2}}{code} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {code}{a: [{b: 1, B: 2}]}{code} The SQL `SELECT a[0].b from t` will pass and pick the first `b` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279883#comment-14279883 ] Yin Huai commented on SPARK-5260: - [~sonixbp] If you like, you can make the change and create a pull request. I can help you on that. btw, just a note. We do not add fix version(s) until it has been merged into our code base. Expose JsonRDD.allKeysWithValueTypes() in a utility class -- Key: SPARK-5260 URL: https://issues.apache.org/jira/browse/SPARK-5260 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet Fix For: 1.3.0 I have found this method extremely useful when implementing my own strategy for inferring a schema from parsed json. For now, I've actually copied the method right out of the JsonRDD class into my own project but I think it would be immensely useful to keep the code in Spark and expose it publicly somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5278) ambiguous reference to fields in Spark SQL is incompleted
[ https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279892#comment-14279892 ] Apache Spark commented on SPARK-5278: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/4068 ambiguous reference to fields in Spark SQL is incompleted - Key: SPARK-5278 URL: https://issues.apache.org/jira/browse/SPARK-5278 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan at hive context for json string like {code}{a: {b: 1, B: 2}}{code} The SQL `SELECT a.b from t` will report error for ambiguous reference to fields. But for json string like {code}{a: [{b: 1, B: 2}]}{code} The SQL `SELECT a[0].b from t` will pass and pick the first `b` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5251) Using `tableIdentifier` in hive metastore
[ https://issues.apache.org/jira/browse/SPARK-5251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-5251: --- Target Version/s: 1.3.0 Using `tableIdentifier` in hive metastore -- Key: SPARK-5251 URL: https://issues.apache.org/jira/browse/SPARK-5251 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei Using `tableIdentifier` in hive metastore -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5251) Using `tableIdentifier` in hive metastore
[ https://issues.apache.org/jira/browse/SPARK-5251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-5251: --- Target Version/s: (was: 1.3.0) Using `tableIdentifier` in hive metastore -- Key: SPARK-5251 URL: https://issues.apache.org/jira/browse/SPARK-5251 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.0 Reporter: wangfei Using `tableIdentifier` in hive metastore -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2686) Add Length support to Spark SQL and HQL and Strlen support to SQL
[ https://issues.apache.org/jira/browse/SPARK-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279914#comment-14279914 ] Reynold Xin commented on SPARK-2686: Do you mind closing the pull request? I will reopen the ticket. Add Length support to Spark SQL and HQL and Strlen support to SQL - Key: SPARK-2686 URL: https://issues.apache.org/jira/browse/SPARK-2686 Project: Spark Issue Type: Improvement Components: SQL Environment: all Reporter: Stephen Boesch Priority: Minor Labels: hql, length, sql Original Estimate: 0h Remaining Estimate: 0h Syntactic, parsing, and operational support have been added for LEN(GTH) and STRLEN functions. Examples: SQL: import org.apache.spark.sql._ case class TestData(key: Int, value: String) val sqlc = new SQLContext(sc) import sqlc._ val testData: SchemaRDD = sqlc.sparkContext.parallelize( (1 to 100).map(i = TestData(i, i.toString))) testData.registerAsTable(testData) sqlc.sql(select length(key) as key_len from testData order by key_len desc limit 5).collect res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2]) HQL: val hc = new org.apache.spark.sql.hive.HiveContext(sc) import hc._ hc.hql hql(select length(grp) from simplex).collect res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6]) As far as codebase changes: they have been purposefully made similar to the ones made for for adding SUBSTR(ING) from July 17: SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main classes changed. The testing suites affected are ConstantFolding and ExpressionEvaluation. In addition some ad-hoc testing was done as shown in the examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-2686) Add Length support to Spark SQL and HQL and Strlen support to SQL
[ https://issues.apache.org/jira/browse/SPARK-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin reopened SPARK-2686: Add Length support to Spark SQL and HQL and Strlen support to SQL - Key: SPARK-2686 URL: https://issues.apache.org/jira/browse/SPARK-2686 Project: Spark Issue Type: Improvement Components: SQL Environment: all Reporter: Stephen Boesch Priority: Minor Labels: hql, length, sql Original Estimate: 0h Remaining Estimate: 0h Syntactic, parsing, and operational support have been added for LEN(GTH) and STRLEN functions. Examples: SQL: import org.apache.spark.sql._ case class TestData(key: Int, value: String) val sqlc = new SQLContext(sc) import sqlc._ val testData: SchemaRDD = sqlc.sparkContext.parallelize( (1 to 100).map(i = TestData(i, i.toString))) testData.registerAsTable(testData) sqlc.sql(select length(key) as key_len from testData order by key_len desc limit 5).collect res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2]) HQL: val hc = new org.apache.spark.sql.hive.HiveContext(sc) import hc._ hc.hql hql(select length(grp) from simplex).collect res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6]) As far as codebase changes: they have been purposefully made similar to the ones made for for adding SUBSTR(ING) from July 17: SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main classes changed. The testing suites affected are ConstantFolding and ExpressionEvaluation. In addition some ad-hoc testing was done as shown in the examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2686) Add Length support to Spark SQL and HQL and Strlen support to SQL
[ https://issues.apache.org/jira/browse/SPARK-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279913#comment-14279913 ] Reynold Xin commented on SPARK-2686: [~javadba] I think Michael meant closing the pull request, but not the ticket ... Add Length support to Spark SQL and HQL and Strlen support to SQL - Key: SPARK-2686 URL: https://issues.apache.org/jira/browse/SPARK-2686 Project: Spark Issue Type: Improvement Components: SQL Environment: all Reporter: Stephen Boesch Priority: Minor Labels: hql, length, sql Original Estimate: 0h Remaining Estimate: 0h Syntactic, parsing, and operational support have been added for LEN(GTH) and STRLEN functions. Examples: SQL: import org.apache.spark.sql._ case class TestData(key: Int, value: String) val sqlc = new SQLContext(sc) import sqlc._ val testData: SchemaRDD = sqlc.sparkContext.parallelize( (1 to 100).map(i = TestData(i, i.toString))) testData.registerAsTable(testData) sqlc.sql(select length(key) as key_len from testData order by key_len desc limit 5).collect res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2]) HQL: val hc = new org.apache.spark.sql.hive.HiveContext(sc) import hc._ hc.hql hql(select length(grp) from simplex).collect res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6]) As far as codebase changes: they have been purposefully made similar to the ones made for for adding SUBSTR(ING) from July 17: SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main classes changed. The testing suites affected are ConstantFolding and ExpressionEvaluation. In addition some ad-hoc testing was done as shown in the examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4867) UDF clean up
[ https://issues.apache.org/jira/browse/SPARK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-4867: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-5166 UDF clean up Key: SPARK-4867 URL: https://issues.apache.org/jira/browse/SPARK-4867 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Blocker Right now our support and internal implementation of many functions has a few issues. Specifically: - UDFS don't know their input types and thus don't do type coercion. - We hard code a bunch of built in functions into the parser. This is bad because in SQL it creates new reserved words for things that aren't actually keywords. Also it means that for each function we need to add support to both SQLContext and HiveContext separately. For this JIRA I propose we do the following: - Change the interfaces for registerFunction and ScalaUdf to include types for the input arguments as well as the output type. - Add a rule to analysis that does type coercion for UDFs. - Add a parse rule for functions to SQLParser. - Rewrite all the UDFs that are currently hacked into the various parsers using this new functionality. Depending on how big this refactoring becomes we could split parts 12 from part 3 above. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5211) Restore HiveMetastoreTypes.toDataType
[ https://issues.apache.org/jira/browse/SPARK-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5211. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Yin Huai Restore HiveMetastoreTypes.toDataType - Key: SPARK-5211 URL: https://issues.apache.org/jira/browse/SPARK-5211 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Fix For: 1.3.0 It was a public API. Since developers are using it, we need to get it back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2686) Add Length support to Spark SQL and HQL and Strlen support to SQL
[ https://issues.apache.org/jira/browse/SPARK-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279920#comment-14279920 ] Stephen Boesch commented on SPARK-2686: --- ok closed Add Length support to Spark SQL and HQL and Strlen support to SQL - Key: SPARK-2686 URL: https://issues.apache.org/jira/browse/SPARK-2686 Project: Spark Issue Type: Improvement Components: SQL Environment: all Reporter: Stephen Boesch Priority: Minor Labels: hql, length, sql Original Estimate: 0h Remaining Estimate: 0h Syntactic, parsing, and operational support have been added for LEN(GTH) and STRLEN functions. Examples: SQL: import org.apache.spark.sql._ case class TestData(key: Int, value: String) val sqlc = new SQLContext(sc) import sqlc._ val testData: SchemaRDD = sqlc.sparkContext.parallelize( (1 to 100).map(i = TestData(i, i.toString))) testData.registerAsTable(testData) sqlc.sql(select length(key) as key_len from testData order by key_len desc limit 5).collect res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2]) HQL: val hc = new org.apache.spark.sql.hive.HiveContext(sc) import hc._ hc.hql hql(select length(grp) from simplex).collect res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6]) As far as codebase changes: they have been purposefully made similar to the ones made for for adding SUBSTR(ING) from July 17: SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main classes changed. The testing suites affected are ConstantFolding and ExpressionEvaluation. In addition some ad-hoc testing was done as shown in the examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2686) Add Length support to Spark SQL and HQL and Strlen support to SQL
[ https://issues.apache.org/jira/browse/SPARK-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279923#comment-14279923 ] Reynold Xin commented on SPARK-2686: Thanks. Let's pull it in once SPARK-4867 is fixed. Add Length support to Spark SQL and HQL and Strlen support to SQL - Key: SPARK-2686 URL: https://issues.apache.org/jira/browse/SPARK-2686 Project: Spark Issue Type: Improvement Components: SQL Environment: all Reporter: Stephen Boesch Priority: Minor Labels: hql, length, sql Original Estimate: 0h Remaining Estimate: 0h Syntactic, parsing, and operational support have been added for LEN(GTH) and STRLEN functions. Examples: SQL: import org.apache.spark.sql._ case class TestData(key: Int, value: String) val sqlc = new SQLContext(sc) import sqlc._ val testData: SchemaRDD = sqlc.sparkContext.parallelize( (1 to 100).map(i = TestData(i, i.toString))) testData.registerAsTable(testData) sqlc.sql(select length(key) as key_len from testData order by key_len desc limit 5).collect res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2]) HQL: val hc = new org.apache.spark.sql.hive.HiveContext(sc) import hc._ hc.hql hql(select length(grp) from simplex).collect res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6]) As far as codebase changes: they have been purposefully made similar to the ones made for for adding SUBSTR(ING) from July 17: SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main classes changed. The testing suites affected are ConstantFolding and ExpressionEvaluation. In addition some ad-hoc testing was done as shown in the examples. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5262) coalesce should allow NullType and 1 another type in parameters
Adrian Wang created SPARK-5262: -- Summary: coalesce should allow NullType and 1 another type in parameters Key: SPARK-5262 URL: https://issues.apache.org/jira/browse/SPARK-5262 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Currently Coalesce(null, 1, null) would throw exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5262) coalesce should allow NullType and 1 another type in parameters
[ https://issues.apache.org/jira/browse/SPARK-5262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278416#comment-14278416 ] Apache Spark commented on SPARK-5262: - User 'adrian-wang' has created a pull request for this issue: https://github.com/apache/spark/pull/4057 coalesce should allow NullType and 1 another type in parameters --- Key: SPARK-5262 URL: https://issues.apache.org/jira/browse/SPARK-5262 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Currently Coalesce(null, 1, null) would throw exceptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1084) Fix most build warnings
[ https://issues.apache.org/jira/browse/SPARK-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1084: -- Reporter: Sean Owen (was: Sean Owen) Fix most build warnings --- Key: SPARK-1084 URL: https://issues.apache.org/jira/browse/SPARK-1084 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 0.9.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Labels: mvn, sbt, warning Fix For: 1.0.0 I hope another boring tidy-up JIRA might be welcome. I'd like to fix most of the warnings that appear during build, so that developers don't become accustomed to them. The accompanying pull request contains a number of commits to quash most warnings observed through the mvn and sbt builds, although not all of them. FIXED! [WARNING] Parameter tasks is deprecated, use target instead Just a matter of updating tasks - target in inline Ant scripts. WARNING: -p has been deprecated and will be reused for a different (but still very cool) purpose in ScalaTest 2.0. Please change all uses of -p to -R. Goes away with updating scalatest plugin - 1.0-RC2 [WARNING] Note: /Users/srowen/Documents/incubator-spark/core/src/test/scala/org/apache/spark/JavaAPISuite.java uses unchecked or unsafe operations. [WARNING] Note: Recompile with -Xlint:unchecked for details. Mostly @SuppressWarnings(unchecked) but needed a few more things to reveal the warning source: forktrue/fork (also needd for maxmem) and version 3.1 of the plugin. In a few cases some declaration changes were appropriate to avoid warnings. /Users/srowen/Documents/incubator-spark/core/src/main/scala/org/apache/spark/util/IndestructibleActorSystem.scala:25: warning: Could not find any member to link for akka.actor.ActorSystem. /** ^ Getting several scaladoc errors like this and I'm not clear why it can't find the type -- outside its module? Remove the links as they're evidently not linking anyway? /Users/srowen/Documents/incubator-spark/repl/src/main/scala/org/apache/spark/repl/SparkIMain.scala:86: warning: Variable eval undefined in comment for class SparkIMain in class SparkIMain $ has to be escaped as \$ in scaladoc, apparently [WARNING] 'dependencyManagement.dependencies.dependency.exclusions.exclusion.artifactId' for org.apache.hadoop:hadoop-yarn-client:jar with value '*' does not match a valid id pattern. @ org.apache.spark:spark-parent:1.0.0-incubating-SNAPSHOT, /Users/srowen/Documents/incubator-spark/pom.xml, line 494, column 25 This one might need review. This is valid Maven syntax, but, Maven still warns on it. I wanted to see if we can do without it. These are trying to exclude: - org.codehaus.jackson - org.sonatype.sisu.inject - org.xerial.snappy org.sonatype.sisu.inject doesn't actually seem to be a dependency anyway. org.xerial.snappy is used by dependencies but the version seems to match anyway (1.0.5). org.codehaus.jackson was intended to exclude 1.8.8, since Spark streaming wants 1.9.11 directly. But the exclusion is in the wrong place if so, since Spark depends straight on Avro, which is what brings in 1.8.8, still. (hadoop-client 1.0.4 includes Jackson 1.0.1, so that needs an exclusion, but the other Hadoop modules don't.) HBase depends on 1.8.8 but figured it was intentional to leave that as it would not collide with Spark streaming. (?) (I understand this varies by Hadoop version but confirmed this is all the same for 1.0.4, 0.23.7, 2.2.0.) NOT FIXED. [warn] /Users/srowen/Documents/incubator-spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:305: method connect in class IOManager is deprecated: use the new implementation in package akka.io instead [warn] override def preStart = IOManager(context.system).connect(new InetSocketAddress(port)) Not confident enough to fix this. [WARNING] there were 6 feature warning(s); re-run with -feature for details Don't know enough Scala to address these, yet. [WARNING] We have a duplicate org/yaml/snakeyaml/scanner/ScannerImpl$Chomping.class in /Users/srowen/.m2/repository/org/yaml/snakeyaml/1.6/snakeyaml-1.6.jar Probably addressable by being more careful about how binaries are packed though this appear to be ignorable; two identical copies of the class are colliding. [WARNING] Zinc server is not available at port 3030 - reverting to normal incremental compile and [WARNING] JAR will be empty - no content was marked for inclusion! Apparently harmless warnings, but I don't know how to disable them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Updated] (SPARK-1181) 'mvn test' fails out of the box since sbt assembly does not necessarily exist
[ https://issues.apache.org/jira/browse/SPARK-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1181: -- Reporter: Sean Owen (was: Sean Owen) 'mvn test' fails out of the box since sbt assembly does not necessarily exist - Key: SPARK-1181 URL: https://issues.apache.org/jira/browse/SPARK-1181 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.0 Reporter: Sean Owen Labels: assembly, maven, sbt, test The test suite requires that sbt assembly has been run in order for some tests (like DriverSuite) to pass. The tests themselves say as much. This means that a mvn test from a fresh clone fails. There's a pretty simple fix, to have Maven's test-compile phase invoke sbt assembly. I suppose the only downside is re-invoking sbt assembly each time tests are run. I'm open to ideas about how to set this up more intelligently but it would be a generally good thing if the Maven build's tests passed out of the box. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1315) spark on yarn-alpha with mvn on master branch won't build
[ https://issues.apache.org/jira/browse/SPARK-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1315: -- Assignee: Sean Owen (was: Sean Owen) spark on yarn-alpha with mvn on master branch won't build - Key: SPARK-1315 URL: https://issues.apache.org/jira/browse/SPARK-1315 Project: Spark Issue Type: Bug Components: Build Reporter: Thomas Graves Assignee: Sean Owen Priority: Blocker Fix For: 1.0.0 I try to build off master branch using maven to build yarn-alpha but get the following errors. mvn -Dyarn.version=0.23.10 -Dhadoop.version=0.23.10 -Pyarn-alpha clean package -DskipTests - [ERROR] /home/tgraves/y-spark-git/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:25: object runtime i s not a member of package reflect [ERROR] import scala.reflect.runtime.universe.runtimeMirror [ERROR] ^ [ERROR] /home/tgraves/y-spark-git/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:40: not found: value runtimeMirror [ERROR] private val mirror = runtimeMirror(classLoader) [ERROR]^ [ERROR] /home/tgraves/y-spark-git/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:92: object tools is not a member of package scala [ERROR] scala.tools.nsc.io.File(.mima-excludes). [ERROR] ^ [ERROR] three errors found -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2879) Use HTTPS to access Maven Central and other repos
[ https://issues.apache.org/jira/browse/SPARK-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-2879: -- Assignee: Sean Owen (was: Sean Owen) Use HTTPS to access Maven Central and other repos - Key: SPARK-2879 URL: https://issues.apache.org/jira/browse/SPARK-2879 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.0.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 Maven Central has just now enabled HTTPS access for everyone to Maven Central (http://central.sonatype.org/articles/2014/Aug/03/https-support-launching-now/) This is timely, as a reminder of how easily an attacker can slip malicious code into a build that's downloading artifacts over HTTP (http://blog.ontoillogical.com/blog/2014/07/28/how-to-take-over-any-java-developer/). In the meantime, it looks like the Spring repo also now supports HTTPS, so can be used this way too. I propose to use HTTPS to access these repos. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
[ https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-3803: -- Assignee: Sean Owen (was: Sean Owen) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents Key: SPARK-3803 URL: https://issues.apache.org/jira/browse/SPARK-3803 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.1.0 Reporter: Masaru Dobashi Assignee: Sean Owen Fix For: 1.2.0 When I executed computePrincipalComponents method of RowMatrix, I got java.lang.ArrayIndexOutOfBoundsException. {code} 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at RDDFunctions.scala:111 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114) org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144) scala.collection.Iterator$class.foreach(Iterator.scala:727) scala.collection.AbstractIterator.foreach(Iterator.scala:1157) scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144) scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157) scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201) scala.collection.AbstractIterator.aggregate(Iterator.scala:1157) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} The RowMatrix instance was generated from the result of TF-IDF like the following. {code} scala val hashingTF = new HashingTF() scala val tf = hashingTF.transform(texts) scala import org.apache.spark.mllib.feature.IDF scala tf.cache() scala val idf = new IDF().fit(tf) scala val tfidf: RDD[Vector] = idf.transform(tf) scala import org.apache.spark.mllib.linalg.distributed.RowMatrix scala val mat = new RowMatrix(tfidf) scala val pc = mat.computePrincipalComponents(2) {code} I think this was because I created HashingTF instance with default numFeatures and Array is used in RowMatrix#computeGramianMatrix method like the following. {code} /** * Computes the Gramian matrix `A^T A`. */ def computeGramianMatrix(): Matrix = { val n = numCols().toInt val nt: Int = n * (n + 1) / 2 // Compute the upper triangular part of the gram matrix. val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))( seqOp = (U, v) = { RowMatrix.dspr(1.0, v, U.data) U }, combOp = (U1, U2) = U1 += U2) RowMatrix.triuToFull(n, GU.data) } {code} When the size of Vectors generated by TF-IDF is too large, it makes nt to have undesirable value (and undesirable size of Array used in treeAggregate), since n * (n + 1) / 2 exceeded Int.MaxValue. Is this surmise correct? And, of
[jira] [Updated] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep
[ https://issues.apache.org/jira/browse/SPARK-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-2749: -- Assignee: Sean Owen (was: Sean Owen) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep --- Key: SPARK-2749 URL: https://issues.apache.org/jira/browse/SPARK-2749 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.0.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 The Maven-based builds in the build matrix have been failing for a few days: https://amplab.cs.berkeley.edu/jenkins/view/Spark/ On inspection, it looks like the Spark SQL Java tests don't compile: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull I confirmed it by repeating the command vs master: mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on com.novocode:junit-interface Adding the junit:junit dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via com.novocode:junit-interface, since that is a bit SBT/Scala-specific (and I am not even sure it's needed). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1556) jets3t dep doesn't update properly with newer Hadoop versions
[ https://issues.apache.org/jira/browse/SPARK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1556: -- Assignee: Sean Owen (was: Sean Owen) jets3t dep doesn't update properly with newer Hadoop versions - Key: SPARK-1556 URL: https://issues.apache.org/jira/browse/SPARK-1556 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.8.1, 0.9.0, 1.0.0 Reporter: Nan Zhu Assignee: Sean Owen Priority: Blocker Fix For: 1.0.0 In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines S3ServiceException/ServiceException is introduced, however, Spark still relies on Jet3st 0.7.x which has no definition of these classes What I met is that [code] 14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:205) at org.apache.spark.SparkContext.runJob(SparkContext.scala:891) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900) at $iwC$$iwC$$iwC$$iwC.init(console:15) at $iwC$$iwC$$iwC.init(console:20) at $iwC$$iwC.init(console:22) at $iwC.init(console:24) at init(console:26) at .init(console:30) at .clinit(console) at .init(console:7) at .clinit(console) at $print(console) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772) at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040) at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640) at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604) at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:793) at
[jira] [Updated] (SPARK-1071) Tidy logging strategy and use of log4j
[ https://issues.apache.org/jira/browse/SPARK-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1071: -- Reporter: Sean Owen (was: Sean Owen) Tidy logging strategy and use of log4j -- Key: SPARK-1071 URL: https://issues.apache.org/jira/browse/SPARK-1071 Project: Spark Issue Type: Improvement Components: Build, Input/Output Affects Versions: 0.9.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.0.0 Prompted by a recent thread on the mailing list, I tried and failed to see if Spark can be made independent of log4j. There are a few cases where control of the underlying logging is pretty useful, and to do that, you have to bind to a specific logger. Instead I propose some tidying that leaves Spark's use of log4j, but gets rid of warnings and should still enable downstream users to switch. The idea is to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J directly when logging, and where Spark needs to output info (REPL and tests), bind from SLF4J to log4j. This leaves the same behavior in Spark. It means that downstream users who want to use something except log4j should: - Exclude dependencies on log4j, slf4j-log4j12 from Spark - Include dependency on log4j-over-slf4j - Include dependency on another logger X, and another slf4j-X - Recreate any log config that Spark does, that is needed, in the other logger's config That sounds about right. Here are the key changes: - Include the jcl-over-slf4j shim everywhere by depending on it in core. - Exclude dependencies on commons-logging from third-party libraries. - Include the jul-to-slf4j shim everywhere by depending on it in core. - Exclude slf4j-* dependencies from third-party libraries to prevent collision or warnings - Added missing slf4j-log4j12 binding to GraphX, Bagel module tests And minor/incidental changes: - Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a recommended update over 1.7.2 - (Remove a duplicate HBase dependency declaration in SparkBuild.scala) - (Remove a duplicate mockito dependency declaration that was causing warnings and bugging me) Pull request coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1254) Consolidate, order, and harmonize repository declarations in Maven/SBT builds
[ https://issues.apache.org/jira/browse/SPARK-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1254: -- Reporter: Sean Owen (was: Sean Owen) Consolidate, order, and harmonize repository declarations in Maven/SBT builds - Key: SPARK-1254 URL: https://issues.apache.org/jira/browse/SPARK-1254 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 0.9.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.0.0 This suggestion addresses a few minor suboptimalities with how repositories are handled. 1) Use HTTPS consistently to access repos, instead of HTTP 2) Consolidate repository declarations in the parent POM file, in the case of the Maven build, so that their ordering can be controlled to put the fully optional Cloudera repo at the end, after required repos. (This was prompted by the untimely failure of the Cloudera repo this week, which made the Spark build fail. #2 would have prevented that.) 3) Update SBT build to match Maven build in this regard 4) Update SBT build to *not* refer to Sonatype snapshot repos. This wasn't in Maven, and a build generally would not refer to external snapshots, but I'm not 100% sure on this one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1335) Also increase perm gen / code cache for scalatest when invoked via Maven build
[ https://issues.apache.org/jira/browse/SPARK-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1335: -- Reporter: Sean Owen (was: Sean Owen) Also increase perm gen / code cache for scalatest when invoked via Maven build -- Key: SPARK-1335 URL: https://issues.apache.org/jira/browse/SPARK-1335 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.0 Reporter: Sean Owen Assignee: Sean Owen Fix For: 1.0.0 I am observing build failures when the Maven build reaches tests in the new SQL components. (I'm on Java 7 / OSX 10.9). The failure is the usual complaint from scala, that it's out of permgen space, or that JIT out of code cache space. I see that various build scripts increase these both for SBT. This change simply adds these settings to scalatest's arguments. Works for me and seems a bit more consistent. (In the PR I'm going to tack on some other little changes too -- see PR.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1316) Remove use of Commons IO
[ https://issues.apache.org/jira/browse/SPARK-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1316: -- Reporter: Sean Owen (was: Sean Owen) Remove use of Commons IO Key: SPARK-1316 URL: https://issues.apache.org/jira/browse/SPARK-1316 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 0.9.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 (This follows from a side point on SPARK-1133, in discussion of the PR: https://github.com/apache/spark/pull/164 ) Commons IO is barely used in the project, and can easily be replaced with equivalent calls to Guava or the existing Spark Utils.scala class. Removing a dependency feels good, and this one in particular can get a little problematic since Hadoop uses it too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets
[ https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-2341: -- Assignee: Sean Owen (was: Sean Owen) loadLibSVMFile doesn't handle regression datasets - Key: SPARK-2341 URL: https://issues.apache.org/jira/browse/SPARK-2341 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.0.0 Reporter: Eustache Assignee: Sean Owen Priority: Minor Labels: easyfix Fix For: 1.1.0 Many datasets exist in LibSVM format for regression tasks [1] but currently the loadLibSVMFile primitive doesn't handle regression datasets. More precisely, the LabelParser is either a MulticlassLabelParser or a BinaryLabelParser. What happens then is that the file is loaded but in multiclass mode : each target value is interpreted as a class name ! The fix would be to write a RegressionLabelParser which converts target values to Double and plug it into the loadLibSVMFile routine. [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1071) Tidy logging strategy and use of log4j
[ https://issues.apache.org/jira/browse/SPARK-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1071: -- Assignee: Sean Owen (was: Sean Owen) Tidy logging strategy and use of log4j -- Key: SPARK-1071 URL: https://issues.apache.org/jira/browse/SPARK-1071 Project: Spark Issue Type: Improvement Components: Build, Input/Output Affects Versions: 0.9.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.0.0 Prompted by a recent thread on the mailing list, I tried and failed to see if Spark can be made independent of log4j. There are a few cases where control of the underlying logging is pretty useful, and to do that, you have to bind to a specific logger. Instead I propose some tidying that leaves Spark's use of log4j, but gets rid of warnings and should still enable downstream users to switch. The idea is to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J directly when logging, and where Spark needs to output info (REPL and tests), bind from SLF4J to log4j. This leaves the same behavior in Spark. It means that downstream users who want to use something except log4j should: - Exclude dependencies on log4j, slf4j-log4j12 from Spark - Include dependency on log4j-over-slf4j - Include dependency on another logger X, and another slf4j-X - Recreate any log config that Spark does, that is needed, in the other logger's config That sounds about right. Here are the key changes: - Include the jcl-over-slf4j shim everywhere by depending on it in core. - Exclude dependencies on commons-logging from third-party libraries. - Include the jul-to-slf4j shim everywhere by depending on it in core. - Exclude slf4j-* dependencies from third-party libraries to prevent collision or warnings - Added missing slf4j-log4j12 binding to GraphX, Bagel module tests And minor/incidental changes: - Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a recommended update over 1.7.2 - (Remove a duplicate HBase dependency declaration in SparkBuild.scala) - (Remove a duplicate mockito dependency declaration that was causing warnings and bugging me) Pull request coming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2798) Correct several small errors in Flume module pom.xml files
[ https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-2798: -- Assignee: Sean Owen (was: Sean Owen) Correct several small errors in Flume module pom.xml files -- Key: SPARK-2798 URL: https://issues.apache.org/jira/browse/SPARK-2798 Project: Spark Issue Type: Bug Components: Build Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 (EDIT) Since the scalatest issue was since resolved, this is now about a few small problems in the Flume Sink pom.xml - scalatest is not declared as a test-scope dependency - Its Avro version doesn't match the rest of the build - Its Flume version is not synced with the other Flume module - The other Flume module declares its dependency on Flume Sink slightly incorrectly, hard-coding the Scala 2.10 version - It depends on Scala Lang directly, which it shouldn't -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5263) `create table` DDL need to check if table exists first
shengli created SPARK-5263: -- Summary: `create table` DDL need to check if table exists first Key: SPARK-5263 URL: https://issues.apache.org/jira/browse/SPARK-5263 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: shengli Priority: Minor Fix For: 1.3.0 `create table` DDL need to check if table exists first -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5246) spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve
[ https://issues.apache.org/jira/browse/SPARK-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Grigor updated SPARK-5246: --- Description: ##How to reproduce: 1) http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html should be sufficient to setup VPC for this bug. After you followed that guide, start new instance in VPC, ssh to it (though NAT server) 2) user starts a cluster in VPC: {code} ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 --subnet-id=subnet-2571dd4d --zone=eu-west-1a launch SparkByScript Setting up security groups... (omitted for brevity) 10.1.1.62 10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop no org.apache.spark.deploy.master.Master to stop starting org.apache.spark.deploy.master.Master, logging to /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out failed to launch org.apache.spark.deploy.master.Master: at java.net.InetAddress.getLocalHost(InetAddress.java:1469) ... 12 more full log in /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out 10.1.1.62: starting org.apache.spark.deploy.worker.Worker, logging to /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out 10.1.1.62: failed to launch org.apache.spark.deploy.worker.Worker: 10.1.1.62: at java.net.InetAddress.getLocalHost(InetAddress.java:1469) 10.1.1.62: ... 12 more 10.1.1.62: full log in /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out [timing] spark-standalone setup: 00h 00m 28s (omitted for brevity) {code} /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out {code} Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp :::/root/ephemeral-hdfs/conf:/root/spark/sbin/../conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip 10.1.1.151 --port 7077 --webui-port 8080 15/01/14 07:34:47 INFO master.Master: Registered signal handlers for [TERM, HUP, INT] Exception in thread main java.net.UnknownHostException: ip-10-1-1-151: ip-10-1-1-151: Name or service not known at java.net.InetAddress.getLocalHost(InetAddress.java:1473) at org.apache.spark.util.Utils$.findLocalIpAddress(Utils.scala:620) at org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:612) at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:612) at org.apache.spark.util.Utils$.localIpAddressHostname$lzycompute(Utils.scala:613) at org.apache.spark.util.Utils$.localIpAddressHostname(Utils.scala:613) at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665) at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.util.Utils$.localHostName(Utils.scala:665) at org.apache.spark.deploy.master.MasterArguments.init(MasterArguments.scala:27) at org.apache.spark.deploy.master.Master$.main(Master.scala:819) at org.apache.spark.deploy.master.Master.main(Master.scala) Caused by: java.net.UnknownHostException: ip-10-1-1-151: Name or service not known at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) at java.net.InetAddress.getLocalHost(InetAddress.java:1469) ... 12 more {code} Problem is that instance launched in VPC may be not able to resolve own local hostname. Please see https://forums.aws.amazon.com/thread.jspa?threadID=92092. I am going to submit a fix for this problem since I need this functionality asap. ## How to reproduce was: How to reproduce: 1) user starts a cluster in VPC: {code} ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 --subnet-id=subnet-2571dd4d --zone=eu-west-1a launch SparkByScript Setting up security groups... (omitted for brevity) 10.1.1.62 10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop no org.apache.spark.deploy.master.Master to stop starting org.apache.spark.deploy.master.Master, logging to /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out failed to launch org.apache.spark.deploy.master.Master: at
[jira] [Updated] (SPARK-5246) spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve
[ https://issues.apache.org/jira/browse/SPARK-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Grigor updated SPARK-5246: --- Description: How to reproduce: 1) http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html should be sufficient to setup VPC for this bug. After you followed that guide, start new instance in VPC, ssh to it (though NAT server) 2) user starts a cluster in VPC: {code} ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 --subnet-id=subnet-2571dd4d --zone=eu-west-1a launch SparkByScript Setting up security groups... (omitted for brevity) 10.1.1.62 10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop no org.apache.spark.deploy.master.Master to stop starting org.apache.spark.deploy.master.Master, logging to /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out failed to launch org.apache.spark.deploy.master.Master: at java.net.InetAddress.getLocalHost(InetAddress.java:1469) ... 12 more full log in /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out 10.1.1.62: starting org.apache.spark.deploy.worker.Worker, logging to /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out 10.1.1.62: failed to launch org.apache.spark.deploy.worker.Worker: 10.1.1.62: at java.net.InetAddress.getLocalHost(InetAddress.java:1469) 10.1.1.62: ... 12 more 10.1.1.62: full log in /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out [timing] spark-standalone setup: 00h 00m 28s (omitted for brevity) {code} /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out {code} Spark assembly has been built with Hive, including Datanucleus jars on classpath Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp :::/root/ephemeral-hdfs/conf:/root/spark/sbin/../conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m org.apache.spark.deploy.master.Master --ip 10.1.1.151 --port 7077 --webui-port 8080 15/01/14 07:34:47 INFO master.Master: Registered signal handlers for [TERM, HUP, INT] Exception in thread main java.net.UnknownHostException: ip-10-1-1-151: ip-10-1-1-151: Name or service not known at java.net.InetAddress.getLocalHost(InetAddress.java:1473) at org.apache.spark.util.Utils$.findLocalIpAddress(Utils.scala:620) at org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:612) at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:612) at org.apache.spark.util.Utils$.localIpAddressHostname$lzycompute(Utils.scala:613) at org.apache.spark.util.Utils$.localIpAddressHostname(Utils.scala:613) at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665) at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.util.Utils$.localHostName(Utils.scala:665) at org.apache.spark.deploy.master.MasterArguments.init(MasterArguments.scala:27) at org.apache.spark.deploy.master.Master$.main(Master.scala:819) at org.apache.spark.deploy.master.Master.main(Master.scala) Caused by: java.net.UnknownHostException: ip-10-1-1-151: Name or service not known at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901) at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293) at java.net.InetAddress.getLocalHost(InetAddress.java:1469) ... 12 more {code} Problem is that instance launched in VPC may be not able to resolve own local hostname. Please see https://forums.aws.amazon.com/thread.jspa?threadID=92092. I am going to submit a fix for this problem since I need this functionality asap. was: ##How to reproduce: 1) http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html should be sufficient to setup VPC for this bug. After you followed that guide, start new instance in VPC, ssh to it (though NAT server) 2) user starts a cluster in VPC: {code} ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 --subnet-id=subnet-2571dd4d --zone=eu-west-1a launch SparkByScript Setting up security groups... (omitted for brevity) 10.1.1.62 10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop no org.apache.spark.deploy.master.Master to stop starting
[jira] [Created] (SPARK-5264) support `drop table` DDL command
shengli created SPARK-5264: -- Summary: support `drop table` DDL command Key: SPARK-5264 URL: https://issues.apache.org/jira/browse/SPARK-5264 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: shengli Priority: Minor Fix For: 1.3.0 support `drop table` DDL command -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5263) `create table` DDL need to check if table exists first
[ https://issues.apache.org/jira/browse/SPARK-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278476#comment-14278476 ] Apache Spark commented on SPARK-5263: - User 'OopsOutOfMemory' has created a pull request for this issue: https://github.com/apache/spark/pull/4058 `create table` DDL need to check if table exists first --- Key: SPARK-5263 URL: https://issues.apache.org/jira/browse/SPARK-5263 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: shengli Priority: Minor Fix For: 1.3.0 Original Estimate: 72h Remaining Estimate: 72h `create table` DDL need to check if table exists first -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster
[ https://issues.apache.org/jira/browse/SPARK-5243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278502#comment-14278502 ] Takumi Yoshida commented on SPARK-5243: --- Hi! I found, Spark hangs with following situation. i guess there would be some other condition. 1. the cluster has only one worker. yes, running standalone. 2. driver memory + executor memory worker memory I use following settings, but it hang. driver memory = 1g executor memory = 1g worker memory = 3g 3. deploy-mode = cluster no, deploy-mode was client as default. I use follwing code. https://gist.github.com/yoshi0309/33bd912d91c0bb5cdf30 command. ./bin/spark-submit ./ldgourmetALS.py s3n://abc-takumiyoshida/datasets/ --driver-memory 1g machine. Amazon EC2 / m3.medium (3ECU and 3.75GB RAM) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster Key: SPARK-5243 URL: https://issues.apache.org/jira/browse/SPARK-5243 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.2.0 Environment: centos, others should be similar Reporter: yuhao yang Priority: Minor Spark will hang if calling spark-submit under the conditions: 1. the cluster has only one worker. 2. driver memory + executor memory worker memory 3. deploy-mode = cluster This usually happens during development for beginners. There should be some exit mechanism or at least a warning message in the output of the spark-submit. I am preparing PR for the case. And I would like to know your opinions about if a fix is needed and better fix options. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1727) Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs
[ https://issues.apache.org/jira/browse/SPARK-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1727: -- Assignee: Sean Owen (was: Sean Owen) Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs - Key: SPARK-1727 URL: https://issues.apache.org/jira/browse/SPARK-1727 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 0.9.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.0.0 While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs. Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1789) Multiple versions of Netty dependencies cause FlumeStreamSuite failure
[ https://issues.apache.org/jira/browse/SPARK-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1789: -- Assignee: Sean Owen (was: Sean Owen) Multiple versions of Netty dependencies cause FlumeStreamSuite failure -- Key: SPARK-1789 URL: https://issues.apache.org/jira/browse/SPARK-1789 Project: Spark Issue Type: Bug Components: Build Affects Versions: 0.9.1 Reporter: Sean Owen Assignee: Sean Owen Labels: flume, netty, test Fix For: 1.0.0 TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly resolved and will resolve a test failure. I hit the error described at http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html while running FlumeStreamingSuite, and have for a short while (is it just me?) velvia notes: I have found a workaround. If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty. There are at least 3 versions of Netty in play in the build: - the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and that is the immediate problem - the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6. - but, Spark Core directly uses io.netty:netty-all:4.0.17.Final The POMs try to exclude other versions of netty, but are excluding org.jboss.netty:netty, when in fact older versions of io.netty:netty (not netty-all) are also an issue. The org.jboss.netty:netty excludes are largely unnecessary. I replaced many of them with io.netty:netty exclusions until everything agreed on io.netty:netty-all:4.0.17.Final. But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. Down-grading to 3.6.6.Final across the board made some Spark code not compile. If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to work. Part of the reason seems to be that Netty 3.x used the old `org.jboss.netty` packages. This is less than ideal, but is no worse than the current situation. So this PR resolves the issue and improves the JAR hell, even if it leaves the existing theoretical Netty 3-vs-4 conflict: - Remove org.jboss.netty excludes where possible, for clarity; they're not needed except with Hadoop artifacts - Add io.netty:netty excludes where needed -- except, let akka keep its io.netty:netty - Change a bit of test code that actually depended on Netty 3.x, to use 4.x equivalent - Update SBT build accordingly A better change would be to update Akka far enough such that it agrees on Netty 4.x, but I don't know if that's feasible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1802) Audit dependency graph when Spark is built with -Phive
[ https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1802: -- Assignee: Sean Owen (was: Sean Owen) Audit dependency graph when Spark is built with -Phive -- Key: SPARK-1802 URL: https://issues.apache.org/jira/browse/SPARK-1802 Project: Spark Issue Type: Bug Reporter: Patrick Wendell Assignee: Sean Owen Priority: Blocker Fix For: 1.0.0 Attachments: hive-exec-jar-problems.txt I'd like to have binary release for 1.0 include Hive support. Since this isn't enabled by default in the build I don't think it's as well tested, so we should dig around a bit and decide if we need to e.g. add any excludes. {code} $ mvn install -Phive -DskipTests mvn dependency:build-classpath -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort without_hive.txt $ mvn install -Phive -DskipTests mvn dependency:build-classpath -Phive -pl assembly | grep -v INFO | tr : \n | awk ' { FS=/; print ( $(NF) ); }' | sort with_hive.txt $ diff without_hive.txt with_hive.txt antlr-2.7.7.jar antlr-3.4.jar antlr-runtime-3.4.jar 10,14d6 avro-1.7.4.jar avro-ipc-1.7.4.jar avro-ipc-1.7.4-tests.jar avro-mapred-1.7.4.jar bonecp-0.7.1.RELEASE.jar 22d13 commons-cli-1.2.jar 25d15 commons-compress-1.4.1.jar 33,34d22 commons-logging-1.1.1.jar commons-logging-api-1.0.4.jar 38d25 commons-pool-1.5.4.jar 46,49d32 datanucleus-api-jdo-3.2.1.jar datanucleus-core-3.2.2.jar datanucleus-rdbms-3.2.1.jar derby-10.4.2.0.jar 53,57d35 hive-common-0.12.0.jar hive-exec-0.12.0.jar hive-metastore-0.12.0.jar hive-serde-0.12.0.jar hive-shims-0.12.0.jar 60,61d37 httpclient-4.1.3.jar httpcore-4.1.3.jar 68d43 JavaEWAH-0.3.2.jar 73d47 javolution-5.5.1.jar 76d49 jdo-api-3.0.1.jar 78d50 jetty-6.1.26.jar 87d58 jetty-util-6.1.26.jar 93d63 json-20090211.jar 98d67 jta-1.1.jar 103,104d71 libfb303-0.9.0.jar libthrift-0.9.0.jar 112d78 mockito-all-1.8.5.jar 136d101 servlet-api-2.5-20081211.jar 139d103 snappy-0.2.jar 144d107 spark-hive_2.10-1.0.0.jar 151d113 ST4-4.0.4.jar 153d114 stringtemplate-3.2.1.jar 156d116 velocity-1.7.jar 158d117 xz-1.0.jar {code} Some initial investigation suggests we may need to take some precaution surrounding (a) jetty and (b) servlet-api. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1248) Spark build error with Apache Hadoop(Cloudera CDH4)
[ https://issues.apache.org/jira/browse/SPARK-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1248: -- Assignee: Sean Owen (was: Sean Owen) Spark build error with Apache Hadoop(Cloudera CDH4) --- Key: SPARK-1248 URL: https://issues.apache.org/jira/browse/SPARK-1248 Project: Spark Issue Type: Bug Components: Build Reporter: Guoqiang Li Assignee: Sean Owen Fix For: 1.0.0 {code} SPARK_HADOOP_VERSION=2.0.0-cdh4.5.0 SPARK_YARN=true sbt/sbt assembly -d error.log {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1120) Send all dependency logging through slf4j
[ https://issues.apache.org/jira/browse/SPARK-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1120: -- Assignee: Sean Owen (was: Sean Owen) Send all dependency logging through slf4j - Key: SPARK-1120 URL: https://issues.apache.org/jira/browse/SPARK-1120 Project: Spark Issue Type: Improvement Reporter: Patrick Cogan Assignee: Sean Owen Fix For: 1.0.0 There are a few dependencies that pull in other logging frameworks which don't get routed correctly. We should include the relevant slf4j adapters and exclude those logging libraries. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2363) Clean MLlib's sample data files
[ https://issues.apache.org/jira/browse/SPARK-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-2363: -- Assignee: Sean Owen (was: Sean Owen) Clean MLlib's sample data files --- Key: SPARK-2363 URL: https://issues.apache.org/jira/browse/SPARK-2363 Project: Spark Issue Type: Task Components: MLlib Reporter: Xiangrui Meng Assignee: Sean Owen Priority: Minor Fix For: 1.1.0 MLlib has sample data under serveral folders: 1) data/mllib 2) data/ 3) mllib/data/* Per previous discussion with [~matei], we want to put them under `data/mllib` and clean outdated files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1254) Consolidate, order, and harmonize repository declarations in Maven/SBT builds
[ https://issues.apache.org/jira/browse/SPARK-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tony Stevenson updated SPARK-1254: -- Assignee: Sean Owen (was: Sean Owen) Consolidate, order, and harmonize repository declarations in Maven/SBT builds - Key: SPARK-1254 URL: https://issues.apache.org/jira/browse/SPARK-1254 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 0.9.0 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Fix For: 1.0.0 This suggestion addresses a few minor suboptimalities with how repositories are handled. 1) Use HTTPS consistently to access repos, instead of HTTP 2) Consolidate repository declarations in the parent POM file, in the case of the Maven build, so that their ordering can be controlled to put the fully optional Cloudera repo at the end, after required repos. (This was prompted by the untimely failure of the Cloudera repo this week, which made the Spark build fail. #2 would have prevented that.) 3) Update SBT build to match Maven build in this regard 4) Update SBT build to *not* refer to Sonatype snapshot repos. This wasn't in Maven, and a build generally would not refer to external snapshots, but I'm not 100% sure on this one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org