date:20150115


[ 
https://issues.apache.org/jira/browse/SPARK-5268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278786#comment-14278786
 ] 

Apache Spark commented on SPARK-5268:
-

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/4063

 ExecutorBackend exits for irrelevant DisassociatedEvent
 ---

 Key: SPARK-5268
 URL: https://issues.apache.org/jira/browse/SPARK-5268
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Nan Zhu

 In CoarseGrainedExecutorBackend, we subscribe DisassociatedEvent in executor 
 backend actor and exit the program upon receive such event...
 let's consider the following case
 The user may develop an Akka-based program which starts the actor with 
 Spark's actor system and communicate with an external actor system (e.g. an 
 Akka-based receiver in spark streaming which communicates with an external 
 system)  If the external actor system fails or disassociates with the actor 
 within spark's system with purpose, we may receive DisassociatedEvent and the 
 executor is restarted.
 This is not the expected behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model

2015-01-15 Thread Travis Galoppo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278947#comment-14278947
 ] 

Travis Galoppo commented on SPARK-5012:
---

This will probably be affected by SPARK-5019


 Python API for Gaussian Mixture Model
 -

 Key: SPARK-5012
 URL: https://issues.apache.org/jira/browse/SPARK-5012
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Meethu Mathew

 Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5185) pyspark --jars does not add classes to driver class path

2015-01-15 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279040#comment-14279040
 ] 

Marcelo Vanzin commented on SPARK-5185:
---

BTW I talked to Uri offline about this. The cause is that {{sc._jvm.blah}} 
seems to use the system class loader to load blah, and {{--jars}} adds things 
to the application class loader instantiated by SparkSubmit. e.g., this works:

{code}
sc._jvm.java.lang.Thread.currentThread().getContextClassLoader().loadClass(com.cloudera.science.throwaway.ThrowAway).newInstance()
{code}

That being said, I'm not sure what's the expectation here. {{_jvm}}, starting 
with an underscore, gives me the impression that it's not really supposed to be 
a public API.

 pyspark --jars does not add classes to driver class path
 

 Key: SPARK-5185
 URL: https://issues.apache.org/jira/browse/SPARK-5185
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Uri Laserson

 I have some random class I want access to from an Spark shell, say 
 {{com.cloudera.science.throwaway.ThrowAway}}.  You can find the specific 
 example I used here:
 https://gist.github.com/laserson/e9e3bd265e1c7a896652
 I packaged it as {{throwaway.jar}}.
 If I then run {{bin/spark-shell}} like so:
 {code}
 bin/spark-shell --master local[1] --jars throwaway.jar
 {code}
 I can execute
 {code}
 val a = new com.cloudera.science.throwaway.ThrowAway()
 {code}
 Successfully.
 I now run PySpark like so:
 {code}
 PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
 throwaway.jar
 {code}
 which gives me an error when I try to instantiate the class through Py4J:
 {code}
 In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
 ---
 Py4JError Traceback (most recent call last)
 ipython-input-1-4eedbe023c29 in module()
  1 sc._jvm.com.cloudera.science.throwaway.ThrowAway()
 /Users/laserson/repos/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
  in __getattr__(self, name)
 724 def __getattr__(self, name):
 725 if name == '__call__':
 -- 726 raise Py4JError('Trying to call a package.')
 727 new_fqn = self._fqn + '.' + name
 728 command = REFLECTION_COMMAND_NAME +\
 Py4JError: Trying to call a package.
 {code}
 However, if I explicitly add the {{--driver-class-path}} to add the same jar
 {code}
 PYSPARK_DRIVER_PYTHON=ipython bin/pyspark --master local[1] --jars 
 throwaway.jar --driver-class-path throwaway.jar
 {code}
 it works
 {code}
 In [1]: sc._jvm.com.cloudera.science.throwaway.ThrowAway()
 Out[1]: JavaObject id=o18
 {code}
 However, the docs state that {{--jars}} should also set the driver class path.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD


[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279078#comment-14279078
 ] 

Reynold Xin commented on SPARK-5097:


[~hkothari] that is correct. It will be trivially doable to select columns at 
runtime.

For the 2nd one, not yet. That's a very good point. You can always do an extra 
projection. We will try to add it, if not in the 1st iteration, then in the 2nd 
iteration.

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty

2015-01-15 Thread Al M (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278983#comment-14278983
 ] 

Al M commented on SPARK-5270:
-

I just noticed that rdd.partitions.size is set to 0 for empty RDDs and  0 for 
RDDs with data; this is a far more elegant check than the others.

 Elegantly check if RDD is empty
 ---

 Key: SPARK-5270
 URL: https://issues.apache.org/jira/browse/SPARK-5270
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0
 Environment: Centos 6
Reporter: Al M
Priority: Trivial

 Right now there is no clean way to check if an RDD is empty.  As discussed 
 here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
 I'd like a method rdd.isEmpty that returns a boolean.
 This would be especially useful when using streams.  Sometimes my batches are 
 huge in one stream, sometimes I get nothing for hours.  Still I have to run 
 count() to check if there is anything in the RDD.  I can process my empty RDD 
 like the others but it would be more efficient to just skip the empty ones.
 I can also run first() and catch the exception; this is neither a clean nor 
 fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5270) Elegantly check if RDD is empty

2015-01-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278993#comment-14278993
 ] 

Sean Owen commented on SPARK-5270:
--

I think it's conceivable to have an RDD with no elements but nonzero partitions 
though. Witness:

{code}
val empty = sc.parallelize(Array[Int]())
empty.count
...
0
empty.partitions.size
...
8
{code}

 Elegantly check if RDD is empty
 ---

 Key: SPARK-5270
 URL: https://issues.apache.org/jira/browse/SPARK-5270
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0
 Environment: Centos 6
Reporter: Al M
Priority: Trivial

 Right now there is no clean way to check if an RDD is empty.  As discussed 
 here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
 I'd like a method rdd.isEmpty that returns a boolean.
 This would be especially useful when using streams.  Sometimes my batches are 
 huge in one stream, sometimes I get nothing for hours.  Still I have to run 
 count() to check if there is anything in the RDD.  I can process my empty RDD 
 like the others but it would be more efficient to just skip the empty ones.
 I can also run first() and catch the exception; this is neither a clean nor 
 fast solution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5270) Elegantly check if RDD is empty

2015-01-15 Thread Al M (JIRA)

Al M created SPARK-5270:
---

 Summary: Elegantly check if RDD is empty
 Key: SPARK-5270
 URL: https://issues.apache.org/jira/browse/SPARK-5270
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.2.0
 Environment: Centos 6
Reporter: Al M
Priority: Trivial


Right now there is no clean way to check if an RDD is empty.  As discussed 
here: 
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679

This is especially a problem when using streams.  Sometimes my batches are huge 
in one stream, sometimes i get nothing for hours.  Still I have to run count() 
to check if there is anything in the RDD.

I can also run first() and catch the exception; this is neither a clean nor 
fast solution.

I'd like a method rdd.isEmpty that returns a boolean.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5267) Add a streaming module to ingest Apache Camel Messages from a configured endpoints

2015-01-15 Thread Steve Brewin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Brewin updated SPARK-5267:

Description: 
The number of input stream protocols supported by Spark Streaming is quite 
limited, which constrains the number of systems with which it can be integrated.

This proposal solves the problem by adding an optional module that integrates 
Apache Camel, which supports many additional input protocols. Our tried and 
tested implementation of this proposal is spark-streaming-camel. 

An Apache Camel service is run on a separate Thread, consuming each 
http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html
 and storing it into Spark's memory. The provider of the Message is specified 
by any consuming component URI documented at 
http://camel.apache.org/components.html, making all of these protocols 
available to Spark Streaming.

Thoughts?




  was:
The number of input stream protocols supported by Spark Streaming is quite 
limited, which constrains the number of systems with which it can be integrated.

This proposal solves the problem by adding an optional module that integrates 
Apache Camel, which support many more input protocols. Our tried and tested 
implementation of this proposal is spark-streaming-camel. 

An Apache Camel service is run on a separate Thread, consuming each 
http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html
 and storing it into Spark's memory. The provider of the Message is specified 
by any consuming component URI documented at 
http://camel.apache.org/components.html, making all of these protocols 
available to Spark Streaming.

Thoughts?





 Add a streaming module to ingest Apache Camel Messages from a configured 
 endpoints
 --

 Key: SPARK-5267
 URL: https://issues.apache.org/jira/browse/SPARK-5267
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Steve Brewin
  Labels: features
   Original Estimate: 120h
  Remaining Estimate: 120h

 The number of input stream protocols supported by Spark Streaming is quite 
 limited, which constrains the number of systems with which it can be 
 integrated.
 This proposal solves the problem by adding an optional module that integrates 
 Apache Camel, which supports many additional input protocols. Our tried and 
 tested implementation of this proposal is spark-streaming-camel. 
 An Apache Camel service is run on a separate Thread, consuming each 
 http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html
  and storing it into Spark's memory. The provider of the Message is specified 
 by any consuming component URI documented at 
 http://camel.apache.org/components.html, making all of these protocols 
 available to Spark Streaming.
 Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5271) PySpark History Web UI issues

2015-01-15 Thread Andrey Zimovnov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Zimovnov updated SPARK-5271:
---
Component/s: Web UI

 PySpark History Web UI issues
 -

 Key: SPARK-5271
 URL: https://issues.apache.org/jira/browse/SPARK-5271
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.2.0
 Environment: PySpark 1.2.0 in yarn-client mode
Reporter: Andrey Zimovnov

 After successful run of PySpark app via spark-submit in yarn-client mode on 
 Hadoop 2.4 cluster the History UI shows the same as in issue SPARK-3898.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5271) PySpark History Web UI issues

2015-01-15 Thread Andrey Zimovnov (JIRA)

Andrey Zimovnov created SPARK-5271:
--

 Summary: PySpark History Web UI issues
 Key: SPARK-5271
 URL: https://issues.apache.org/jira/browse/SPARK-5271
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: PySpark 1.2.0 in yarn-client mode
Reporter: Andrey Zimovnov


After successful run of PySpark app via spark-submit in yarn-client mode on 
Hadoop 2.4 cluster the History UI shows the same as in issue SPARK-3898.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5270) Elegantly check if RDD is empty

2015-01-15 Thread Al M (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Al M updated SPARK-5270:

Description:
Right now there is no clean way to check if an RDD is empty. As discussed
here:
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679

I'd like a method rdd.isEmpty that returns a boolean.

This would be especially useful when using streams. Sometimes my batches are
huge in one stream, sometimes I get nothing for hours. Still I have to run
count() to check if there is anything in the RDD. I can process my empty RDD
like the others but it would be more efficient to just skip the empty ones.

I can also run first() and catch the exception; this is neither a clean nor
fast solution.

was:
Right now there is no clean way to check if an RDD is empty. As discussed
here:
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679

This is especially a problem when using streams. Sometimes my batches are huge
in one stream, sometimes i get nothing for hours. Still I have to run count()
to check if there is anything in the RDD.

I can also run first() and catch the exception; this is neither a clean nor
fast solution.

I'd like a method rdd.isEmpty that returns a boolean.

Elegantly check if RDD is empty
---

Key: SPARK-5270
URL: https://issues.apache.org/jira/browse/SPARK-5270
Project: Spark
Issue Type: Improvement
Affects Versions: 1.2.0
Environment: Centos 6
Reporter: Al M
Priority: Trivial

Right now there is no clean way to check if an RDD is empty. As discussed
here:
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-if-an-RDD-is-empty-td1678.html#a1679
I'd like a method rdd.isEmpty that returns a boolean.
This would be especially useful when using streams. Sometimes my batches are
huge in one stream, sometimes I get nothing for hours. Still I have to run
count() to check if there is anything in the RDD. I can process my empty RDD
like the others but it would be more efficient to just skip the empty ones.
I can also run first() and catch the exception; this is neither a clean nor
fast solution.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5246) spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve

2015-01-15 Thread Vladimir Grigor (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278780#comment-14278780
 ] 

Vladimir Grigor commented on SPARK-5246:


https://github.com/mesos/spark-ec2/pull/91

 spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does 
 not resolve
 --

 Key: SPARK-5246
 URL: https://issues.apache.org/jira/browse/SPARK-5246
 Project: Spark
  Issue Type: Bug
  Components: EC2
Reporter: Vladimir Grigor

 How to reproduce: 
 1)  http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html 
 should be sufficient to setup VPC for this bug. After you followed that 
 guide, start new instance in VPC, ssh to it (though NAT server)
 2) user starts a cluster in VPC:
 {code}
 ./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
 --spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
 --subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
 Setting up security groups...
 
 (omitted for brevity)
 10.1.1.62
 10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop
 no org.apache.spark.deploy.master.Master to stop
 starting org.apache.spark.deploy.master.Master, logging to 
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
 failed to launch org.apache.spark.deploy.master.Master:
   at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
   ... 12 more
 full log in 
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
 10.1.1.62: starting org.apache.spark.deploy.worker.Worker, logging to 
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
 10.1.1.62: failed to launch org.apache.spark.deploy.worker.Worker:
 10.1.1.62:at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
 10.1.1.62:... 12 more
 10.1.1.62: full log in 
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
 [timing] spark-standalone setup:  00h 00m 28s
  
 (omitted for brevity)
 {code}
 /root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
 {code}
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp 
 :::/root/ephemeral-hdfs/conf:/root/spark/sbin/../conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar
  -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
 org.apache.spark.deploy.master.Master --ip 10.1.1.151 --port 7077 
 --webui-port 8080
 
 15/01/14 07:34:47 INFO master.Master: Registered signal handlers for [TERM, 
 HUP, INT]
 Exception in thread main java.net.UnknownHostException: ip-10-1-1-151: 
 ip-10-1-1-151: Name or service not known
 at java.net.InetAddress.getLocalHost(InetAddress.java:1473)
 at org.apache.spark.util.Utils$.findLocalIpAddress(Utils.scala:620)
 at 
 org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:612)
 at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:612)
 at 
 org.apache.spark.util.Utils$.localIpAddressHostname$lzycompute(Utils.scala:613)
 at 
 org.apache.spark.util.Utils$.localIpAddressHostname(Utils.scala:613)
 at 
 org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
 at 
 org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.util.Utils$.localHostName(Utils.scala:665)
 at 
 org.apache.spark.deploy.master.MasterArguments.init(MasterArguments.scala:27)
 at org.apache.spark.deploy.master.Master$.main(Master.scala:819)
 at org.apache.spark.deploy.master.Master.main(Master.scala)
 Caused by: java.net.UnknownHostException: ip-10-1-1-151: Name or service not 
 known
 at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
 at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
 at 
 java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
 at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
 ... 12 more
 {code}
 Problem is that instance launched in VPC may be not able to resolve own local 
 hostname. Please see  
 https://forums.aws.amazon.com/thread.jspa?threadID=92092.
 I am going to submit a fix for this problem since I need this functionality 
 asap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To

[jira] [Created] (SPARK-5268) ExecutorBackend exits for irrelevant DisassociatedEvent

2015-01-15 Thread Nan Zhu (JIRA)

Nan Zhu created SPARK-5268:
--

 Summary: ExecutorBackend exits for irrelevant DisassociatedEvent
 Key: SPARK-5268
 URL: https://issues.apache.org/jira/browse/SPARK-5268
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Nan Zhu


In CoarseGrainedExecutorBackend, we subscribe DisassociatedEvent in executor 
backend actor and exit the program upon receive such event...

let's consider the following case

The user may develop an Akka-based program which starts the actor with Spark's 
actor system and communicate with an external actor system (e.g. an Akka-based 
receiver in spark streaming which communicates with an external system)  If the 
external actor system fails or disassociates with the actor within spark's 
system with purpose, we may receive DisassociatedEvent and the executor is 
restarted.

This is not the expected behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5267) Add a streaming module to ingest Apache Camel Messages from a configured endpoints

2015-01-15 Thread Steve Brewin (JIRA)

Steve Brewin created SPARK-5267:
---

 Summary: Add a streaming module to ingest Apache Camel Messages 
from a configured endpoints
 Key: SPARK-5267
 URL: https://issues.apache.org/jira/browse/SPARK-5267
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Steve Brewin


The number of input stream protocols supported by Spark Streaming is quite 
limited, which constrains the number of systems with which it can be integrated.

This proposal solves the problem by adding an optional module that integrates 
Apache Camel, which support many more input protocols. Our tried and tested 
implementation of this proposal is spark-streaming-camel. 

An Apache Camel service is run on a separate Thread, consuming each 
http://camel.apache.org/maven/current/camel-core/apidocs/org/apache/camel/Message.html
 and storing it into Spark's memory. The provider of the Message is specified 
by any consuming component URI documented at 
http://camel.apache.org/components.html, making all of these protocols 
available to Spark Streaming.

Thoughts?






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5269) BlockManager.dataDeserialize always creates a new serializer instance

2015-01-15 Thread Ivan Vergiliev (JIRA)

Ivan Vergiliev created SPARK-5269:
-

 Summary: BlockManager.dataDeserialize always creates a new 
serializer instance
 Key: SPARK-5269
 URL: https://issues.apache.org/jira/browse/SPARK-5269
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Ivan Vergiliev


BlockManager.dataDeserialize always creates a new instance of the serializer, 
which is pretty slow in some cases. I'm using Kryo serialization and have a 
custom registrator, and its register method is showing up as taking about 15% 
of the execution time in my profiles. This started happening after I increased 
the number of keys in a job with a shuffle phase by a factor of 40.

One solution I can think of is to create a ThreadLocal SerializerInstance for 
the defaultSerializer, and only create a new one if a custom serializer is 
passed in. AFAICT a custom serializer is passed only from DiskStore.getValues, 
and that, on the other hand, depends on the serializer passed to 
ExternalSorter. I don't know how often this is used, but I think this can still 
be a good solution for the standard use case.
Oh, and also - ExternalSorter already has a SerializerInstance, so if the 
getValues method is called from a single thread, maybe we can pass that 
directly?

I'd be happy to try a patch but would probably need a confirmation from someone 
that this approach would indeed work (or an idea for another).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5097) Adding data frame APIs to SchemaRDD

2015-01-15 Thread Hamel Ajay Kothari (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278819#comment-14278819
 ] 

Hamel Ajay Kothari commented on SPARK-5097:
---

Am I correct in interpreting that this would allow us to trivially select 
columns at runtime since we'd just use {{SchemaRDD(stringColumnName)}}? In the 
world of catalyst selecting columns known only at runtime was a real pain 
because the only defined way to do it in the docs was to use quasiquotes or use 
{{SchemaRDD.baseLogicalPlan.resolve()}}. The first couldn't be defined at 
runtime (as far as I know) and the second required you to depend on expressions.

Also, is there any way to control the name of the resulting columns from 
groupby+aggregate (or similar methods that add columns) in this plan?

 Adding data frame APIs to SchemaRDD
 ---

 Key: SPARK-5097
 URL: https://issues.apache.org/jira/browse/SPARK-5097
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Attachments: DesignDocAddingDataFrameAPIstoSchemaRDD.pdf


 SchemaRDD, through its DSL, already provides common data frame 
 functionalities. However, the DSL was originally created for constructing 
 test cases without much end-user usability and API stability consideration. 
 This design doc proposes a set of API changes for Scala and Python to make 
 the SchemaRDD DSL API more usable and stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-15 Thread Muhammad-Ali A'rabi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279170#comment-14279170
 ] 

Muhammad-Ali A'rabi edited comment on SPARK-5226 at 1/15/15 7:33 PM:
-

This is DBSCAN algorithm:

{noformat}
DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
  mark P as visited
  NeighborPts = regionQuery(P, eps)
  if sizeof(NeighborPts)  MinPts
 mark P as NOISE
  else
 C = next cluster
 expandCluster(P, NeighborPts, C, eps, MinPts)
  
expandCluster(P, NeighborPts, C, eps, MinPts)
   add P to cluster C
   for each point P' in NeighborPts 
  if P' is not visited
 mark P' as visited
 NeighborPts' = regionQuery(P', eps)
 if sizeof(NeighborPts') = MinPts
NeighborPts = NeighborPts joined with NeighborPts'
  if P' is not yet member of any cluster
 add P' to cluster C
  
regionQuery(P, eps)
   return all points within P's eps-neighborhood (including P)
{noformat}

As you can see, there are just two parameters. There is two ways of 
implementation. First one is faster (O(n log n), and requires more memory 
(O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(‌n)). 
But I prefer the first one, as we are not short one memory.
There are two phases of running:
* Preprocessing. In this phase a distance matrix for all point is created and 
distances between every two points is calculated. Very parallel.
* Main Process. In this phase the algorithm will run, as described in 
pseudo-code, and two foreach's are parallelized. Region queries are done very 
fast (O(1)), because of preprocessing.


was (Author: angellandros):
This is DBSCAN algorithm:

{noformat}
DBSCAN(D, eps, MinPts)
   C = 0
   for each unvisited point P in dataset D
  mark P as visited
  NeighborPts = regionQuery(P, eps)
  if sizeof(NeighborPts)  MinPts
 mark P as NOISE
  else
 C = next cluster
 expandCluster(P, NeighborPts, C, eps, MinPts)
  
expandCluster(P, NeighborPts, C, eps, MinPts)
   add P to cluster C
   for each point P' in NeighborPts 
  if P' is not visited
 mark P' as visited
 NeighborPts' = regionQuery(P', eps)
 if sizeof(NeighborPts') = MinPts
NeighborPts = NeighborPts joined with NeighborPts'
  if P' is not yet member of any cluster
 add P' to cluster C
  
regionQuery(P, eps)
   return all points within P's eps-neighborhood (including P)
{noformat}

As you can see, there are just two parameters. There is two ways of 
implementation. First one is faster (O(n log n), and requires more memory 
(O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(n)). But 
I prefer the first one, as we are not short one memory.
There are two phases of running:
* Preprocessing. In this phase a distance matrix for all point is created and 
distances between every two points is calculated. Very parallel.
* Main Process. In this phase the algorithm will run, as described in 
pseudo-code, and two foreach's are parallelized. Region queries are done very 
fast (O(1)), because of preprocessing.

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Muhammad-Ali A'rabi
Priority: Minor
  Labels: DBSCAN

 MLlib is all k-means now, and I think we should add some new clustering 
 algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5273) Improve documentation examples for LinearRegression

2015-01-15 Thread Dev Lakhani (JIRA)

Dev Lakhani created SPARK-5273:
--

 Summary: Improve documentation examples for LinearRegression 
 Key: SPARK-5273
 URL: https://issues.apache.org/jira/browse/SPARK-5273
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Dev Lakhani
Priority: Minor


In the document:
https://spark.apache.org/docs/1.1.1/mllib-linear-methods.html

Under
Linear least squares, Lasso, and ridge regression

The suggested method to use LinearRegressionWithSGD.train()
// Building the model
val numIterations = 100
val model = LinearRegressionWithSGD.train(parsedData, numIterations)

is not ideal even for simple examples such as y=x. This should be replaced with 
more real world parameters with step size:

val lr = new LinearRegressionWithSGD()
lr.optimizer.setStepSize(0.0001)
lr.optimizer.setNumIterations(100)

or

LinearRegressionWithSGD.train(input,100,0.0001)

To create a reasonable MSE. It took me a while using the dev forum to learn 
that the step size should be really small. Might help save someone the same 
effort when learning mllib.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model


[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279251#comment-14279251
 ] 

Joseph K. Bradley commented on SPARK-5012:
--

[~MeethuMathew], [~tgaloppo] makes a good point.  It might actually be best to 
make a Python API for MultivariateGaussian first, and then to do this JIRA.  
(Since we don't want to require scipy currently, we can't use the existing 
scipy.stats.multivariate_normal class.)

 Python API for Gaussian Mixture Model
 -

 Key: SPARK-5012
 URL: https://issues.apache.org/jira/browse/SPARK-5012
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Meethu Mathew

 Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5274) Stabilize UDFRegistration API


[ 
https://issues.apache.org/jira/browse/SPARK-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279352#comment-14279352
 ] 

Apache Spark commented on SPARK-5274:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4056

 Stabilize UDFRegistration API
 -

 Key: SPARK-5274
 URL: https://issues.apache.org/jira/browse/SPARK-5274
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 1. Removed UDFRegistration as a mixin in SQLContext and made it a field 
 (udf). This removes 45 methods from SQLContext.
 2. For Java UDFs, renamed dataType to returnType.
 3. For Scala UDFs, added type tags.
 4. Added all Java UDF registration methods to Scala's UDFRegistration.
 5. Better documentation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5274) Stabilize UDFRegistration API

Reynold Xin created SPARK-5274:
--

 Summary: Stabilize UDFRegistration API
 Key: SPARK-5274
 URL: https://issues.apache.org/jira/browse/SPARK-5274
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


1. Removed UDFRegistration as a mixin in SQLContext and made it a field 
(udf). This removes 45 methods from SQLContext.
2. For Java UDFs, renamed dataType to returnType.
3. For Scala UDFs, added type tags.
4. Added all Java UDF registration methods to Scala's UDFRegistration.
5. Better documentation




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features

[
https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279235#comment-14279235
]

Joseph K. Bradley edited comment on SPARK-5272 at 1/15/15 8:13 PM:
---

My initial thoughts:

(1) Are continuous labels/features important to support?

In terms of when NB *should* be used, I believe they are important. People use
Logistic Regression with continuous labels and features, and Naive Bayes is
really the same type of model (just trained differently).
* E.g.: Ng Jordan. On Discriminative vs. Generative classifiers: A
comparison of logistic regression and naive Bayes. NIPS 2002.
** Theoretically, the 2 types of models have the same purpose, but they should
be used in different regimes.

In terms of when NB is actually used by Spark users, I'm not sure. Hopefully
some research and discussion here will make that clearer.

(2) What should the API look like?

I believe there should be a NaiveBayesClassifier and NaiveBayesRegressor which
use the same underlying implementation. That implementation should include a
Factor concept encoding the type of distribution.

This should be simple to do for Naive Bayes, and it will give some guidance if
we move to support more general probabilistic graphical models in MLlib.

was (Author: josephkb):
My initial thoughts:

(1) Are continuous labels/features important to support?

(2) What should the API look like?

This should be simple to do for Naive Bayes, and it will give some guidance if
we move to support more general probabilistic graphical models in MLlib.

Refactor NaiveBayes to support discrete and continuous labels,features
--

Key: SPARK-5272
URL: https://issues.apache.org/jira/browse/SPARK-5272
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

This JIRA is to discuss refactoring NaiveBayes in order to support both
discrete and continuous labels and features.
Currently, NaiveBayes supports only discrete labels and features.
Proposal: Generalize it to support continuous values as well.
Some items to discuss are:
* How commonly are continuous labels/features used in practice? (Is this
necessary?)
* What should the API look like?
** E.g., should NB have multiple classes for each type of label/feature, or
should it take a general Factor type parameter?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-15 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279250#comment-14279250
 ] 

RJ Nowling commented on SPARK-4894:
---

Thanks, [~josephkb]!  I'd be happy to help with the NB refactoring too :) 

 Add Bernoulli-variant of Naive Bayes
 

 Key: SPARK-4894
 URL: https://issues.apache.org/jira/browse/SPARK-4894
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.0
Reporter: RJ Nowling
Assignee: RJ Nowling

 MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
 version of Naive Bayes is more useful for situations where the features are 
 binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features

2015-01-15 Thread RJ Nowling (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279258#comment-14279258
]

RJ Nowling commented on SPARK-5272:
---

Hi [~josephkb],

I can see benefits to your suggestions of feature types (e.g., categorial,
discrete counts, continuous, binary, etc.). If we created corresponding
FeatureLikelihood types (e.g., Bernoulli, Multinomial, Gaussian, etc.), it
would promote composition which would be easier to test, debug, and maintain
versus multiple NB subclasses like sklearn. Additionally, if the user can
define a type for each feature, then users can mix and match likelihood types
as well. Most NB implementations treat all features the same -- what if we had
a model that allowed heterozygous features? If it works well in NB, it could
be extended to other parts of MLlib. (There is likely some overlap with
decision trees since they support multiple feature types, so we might want to
see if there is anything there we can reuse.) At the API level, we could
provide a basic API which takes {noformat}RDD[Vector[Double]]{noformat} like
the current API so that simplicity isn't compromised and provide a more
advanced API for power users.

Does this sound like I'm understanding you correctly?

Re: Decision trees. Decision tree models generally support different types of
features (categorical, binary, discrete, continuous). Does Spark's decision
tree implementation support those different types? How are they handled? Do
they abstract the feature type? I feel there could be common ground here.

Refactor NaiveBayes to support discrete and continuous labels,features
--

Key: SPARK-5272
URL: https://issues.apache.org/jira/browse/SPARK-5272
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

[
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279274#comment-14279274
]

Joseph K. Bradley edited comment on SPARK-1405 at 1/15/15 9:29 PM:
---

I'll try out the statmt dataset if that will be easier for everyone to access.

UPDATE: Note: The statmt dataset is an odd one since each document is a
single sentence. I'll still try it since I could imagine a lot of users
wanting to run LDA on tweets or other short documents, but I might continue
with my previous tests first.

was (Author: josephkb):
I'll try out the statmt dataset if that will be easier for everyone to access.

parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
-

Key: SPARK-1405
URL: https://issues.apache.org/jira/browse/SPARK-1405
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
Labels: features
Attachments: performance_comparison.png

Original Estimate: 336h
Remaining Estimate: 336h

Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts
topics from text corpus. Different with current machine learning algorithms
in MLlib, instead of using optimization algorithms such as gradient desent,
LDA uses expectation algorithms such as Gibbs sampling.
In this PR, I prepare a LDA implementation based on Gibbs sampling, with a
wholeTextFiles API (solved yet), a word segmentation (import from Lucene),
and a Gibbs sampling core.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5224) parallelize list/ndarray is really slow

2015-01-15 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5224.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

Issue resolved by pull request 4024
[https://github.com/apache/spark/pull/4024]

 parallelize list/ndarray is really slow
 ---

 Key: SPARK-5224
 URL: https://issues.apache.org/jira/browse/SPARK-5224
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Davies Liu
Priority: Blocker
 Fix For: 1.3.0, 1.2.1


 After the default batchSize changed to 0 (batched based on the size of 
 object), but parallelize() still use BatchedSerializer with batchSize=1.
 Also, BatchedSerializer did not work well with list and numpy.ndarray



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5224) parallelize list/ndarray is really slow

2015-01-15 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5224:
--
Assignee: Davies Liu

 parallelize list/ndarray is really slow
 ---

 Key: SPARK-5224
 URL: https://issues.apache.org/jira/browse/SPARK-5224
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.3.0, 1.2.1


 After the default batchSize changed to 0 (batched based on the size of 
 object), but parallelize() still use BatchedSerializer with batchSize=1.
 Also, BatchedSerializer did not work well with list and numpy.ndarray



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5272) Refactor NaiveBayes to support discrete and continuous labels,features

Joseph K. Bradley created SPARK-5272:


 Summary: Refactor NaiveBayes to support discrete and continuous 
labels,features
 Key: SPARK-5272
 URL: https://issues.apache.org/jira/browse/SPARK-5272
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley


This JIRA is to discuss refactoring NaiveBayes in order to support both 
discrete and continuous labels and features.

Currently, NaiveBayes supports only discrete labels and features.

Proposal: Generalize it to support continuous values as well.

Some items to discuss are:
* How commonly are continuous labels/features used in practice?  (Is this 
necessary?)
* What should the API look like?
** E.g., should NB have multiple classes for each type of label/feature, or 
should it take a general Factor type parameter?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes


[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279241#comment-14279241
 ] 

Joseph K. Bradley commented on SPARK-4894:
--

[~rnowling]  I too don't want to hold up the Bernoulli NB too much.  I just 
made  linked a JIRA per your suggestion 
[https://issues.apache.org/jira/browse/SPARK-5272].  I'll add my thoughts there 
(and feel free to copy yours there too).

I'm not sure if we can reuse much from decision trees since they are not 
probabilistic models and have a different concept of loss or error.

For now, generalizing the existing Naive Bayes class to handle the Bernoulli 
case sounds good.  Thanks for taking the time to discuss this!

 Add Bernoulli-variant of Naive Bayes
 

 Key: SPARK-4894
 URL: https://issues.apache.org/jira/browse/SPARK-4894
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.0
Reporter: RJ Nowling
Assignee: RJ Nowling

 MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
 version of Naive Bayes is more useful for situations where the features are 
 binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5


[ 
https://issues.apache.org/jira/browse/SPARK-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279216#comment-14279216
 ] 

Apache Spark commented on SPARK-5111:
-

User 'zhzhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/4064

 HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5
 ---

 Key: SPARK-5111
 URL: https://issues.apache.org/jira/browse/SPARK-5111
 Project: Spark
  Issue Type: Bug
Reporter: Zhan Zhang

 Due to java.lang.NoSuchFieldError: SASL_PROPS error. Need to backport some 
 hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 
 support in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4746) integration tests should be separated from faster unit tests

2015-01-15 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279524#comment-14279524
 ] 

Imran Rashid commented on SPARK-4746:
-

This doesn't work as well as I thought -- all of the junit tests get skipped.  
The problem is a mismatch between the way test args are handled by the junit 
test runner and the scalatest runner.

I think our options are:

1) abandon a tag-based approach: just use directories / file names to separate 
out unit tests  integration tests

2) change all of our junit tests to scalatest.  (its perfectly fine to test 
java code w/ scalatest.)

3) See if we can get scalatest to also run our junit tests

4) change the sbt task to first run scalatest, with all junit tests turned off, 
and then just run the junit tests, so that we can pass in different args to 
each one.

5) just live w/ the fact that the junit tests never match the tags so they are 
effectively considered integration tests.

Note that junit has a notion similar to tags in categories: 
https://github.com/junit-team/junit/wiki/Categories
The main problem here is the difference in the args for the two test runners.

 integration tests should be separated from faster unit tests
 

 Key: SPARK-4746
 URL: https://issues.apache.org/jira/browse/SPARK-4746
 Project: Spark
  Issue Type: Bug
Reporter: Imran Rashid
Priority: Trivial

 Currently there isn't a good way for a developer to skip the longer 
 integration tests.  This can slow down local development.  See 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html
 One option is to use scalatest's notion of test tags to tag all integration 
 tests, so they could easily be skipped



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5274) Stabilize UDFRegistration API


 [ 
https://issues.apache.org/jira/browse/SPARK-5274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5274.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Stabilize UDFRegistration API
 -

 Key: SPARK-5274
 URL: https://issues.apache.org/jira/browse/SPARK-5274
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.3.0


 1. Removed UDFRegistration as a mixin in SQLContext and made it a field 
 (udf). This removes 45 methods from SQLContext.
 2. For Java UDFs, renamed dataType to returnType.
 3. For Scala UDFs, added type tags.
 4. Added all Java UDF registration methods to Scala's UDFRegistration.
 5. Better documentation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5144) spark-yarn module should be published

2015-01-15 Thread Matthew Sanders (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279457#comment-14279457
 ] 

Matthew Sanders commented on SPARK-5144:


+1 -- I am in a similar situation and would love to see this addressed somehow. 

 spark-yarn module should be published
 -

 Key: SPARK-5144
 URL: https://issues.apache.org/jira/browse/SPARK-5144
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Aniket Bhatnagar

 We disabled publishing of certain modules in SPARK-3452. One of such modules 
 is spark-yarn. This breaks applications that submit spark jobs 
 programatically with master set as yarn-client. This is because SparkContext 
 is dependent on classes from yarn-client module to submit the YARN 
 application. 
 Here is the stack trace that you get if you submit the spark job without 
 yarn-client dependency:
 2015-01-07 14:39:22,799 [pool-10-thread-13] [info] o.a.s.s.MemoryStore - 
 MemoryStore started with capacity 731.7 MB
 Exception in thread pool-10-thread-13 java.lang.ExceptionInInitializerError
 at org.apache.spark.util.Utils$.getSparkOrYarnConfig(Utils.scala:1784)
 at org.apache.spark.storage.BlockManager.init(BlockManager.scala:105)
 at org.apache.spark.storage.BlockManager.init(BlockManager.scala:180)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:292)
 at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159)
 at org.apache.spark.SparkContext.init(SparkContext.scala:232)
 at com.myimpl.Server:23)
 at scala.util.Success$$anonfun$map$1.apply(Try.scala:236)
 at scala.util.Try$.apply(Try.scala:191)
 at scala.util.Success.map(Try.scala:236)
 at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23)
 at com.myimpl.FutureTry$$anonfun$1.apply(FutureTry.scala:23)
 at scala.util.Success$$anonfun$map$1.apply(Try.scala:236)
 at scala.util.Try$.apply(Try.scala:191)
 at scala.util.Success.map(Try.scala:236)
 at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
 at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:235)
 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 Caused by: org.apache.spark.SparkException: Unable to load YARN support
 at 
 org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:199)
 at org.apache.spark.deploy.SparkHadoopUtil$.init(SparkHadoopUtil.scala:194)
 at org.apache.spark.deploy.SparkHadoopUtil$.clinit(SparkHadoopUtil.scala)
 ... 27 more
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:190)
 at 
 org.apache.spark.deploy.SparkHadoopUtil$.liftedTree1$1(SparkHadoopUtil.scala:195)
 ... 29 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-01-15 Thread Josh Rosen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279406#comment-14279406
]

Josh Rosen commented on SPARK-4879:
---

I'm not sure that SparkHadoopWriter's use of FileOutputCommitter properly obeys
the OutputCommitter contracts in Hadoop. According to the [OutputCommitter
Javadoc|https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/OutputCommitter.html]

{quote}
The methods in this class can be called from several different processes and
from several different contexts. It is important to know which process and
which context each is called from. Each method should be marked accordingly in
its documentation. It is also important to note that not all methods are
guaranteed to be called once and only once. If a method is not guaranteed to
have this property the output committer needs to handle this appropriately.
Also note it will only be in rare situations where they may be called multiple
times for the same task.
{quote}

Based on the documentation, `needsTaskCommit` is called from each individual
task's process that will output to HDFS, and it is called just for that task.,
so it seems like it should be safe to call this from SparkHadoopWriter.

However, maybe we're misusing the `commitTask` method:

{quote}
If needsTaskCommit(TaskAttemptContext) returns true and this task is the task
that the AM determines finished first, this method is called to commit an
individual task's output. This is to mark that tasks output as complete, as
commitJob(JobContext) will also be called later on if the entire job finished
successfully. This is called from a task's process. This may be called multiple
times for the same task, but different task attempts. It should be very rare
for this to be called multiple times and requires odd networking failures to
make this happen. In the future the Hadoop framework may eliminate this race.
{quote}

I think that we're missing the this task is the task that the AM determines
finished first part of the equation here. If `needsTaskCommit` is false, then
we definitely shouldn't commit (e.g. if it's an original task that lost to a
speculated copy), but if it's true then I don't think it's safe to commit; we
need some central authority to pick a winner.

Let's see how Hadoop does things, working backwards from actual calls of
`commitTask` to see whether they're guarded by some coordination through the
AM. It looks like `OutputCommitter` is part of the `mapred` API, so I'll only
look at classes in that package:

In `Task.java`, `committer.commitTask` is only performed after checking
`canCommit` through `TaskUmbilicalProtocol`:
https://github.com/apache/hadoop/blob/a655973e781caf662b360c96e0fa3f5a873cf676/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Task.java#L1185.
According to the Javadocs for TaskAttemptListenerImpl.canCommit (the actual
concrete implementation of this method):

{code}
/**
* Child checking whether it can commit.
*
* br/
* Commit is a two-phased protocol. First the attempt informs the
* ApplicationMaster that it is
* {@link #commitPending(TaskAttemptID, TaskStatus)}. Then it repeatedly polls
* the ApplicationMaster whether it {@link #canCommit(TaskAttemptID)} This is
* a legacy from the centralized commit protocol handling by the JobTracker.
*/
@Override
public boolean canCommit(TaskAttemptID taskAttemptID) throws IOException {
{code}

This ends up delegating to `Task.canCommit()`:

{code}
/**
* Can the output of the taskAttempt be committed. Note that once the task
* gives a go for a commit, further canCommit requests from any other attempts
* should return false.
*
* @param taskAttemptID
* @return whether the attempt's output can be committed or not.
*/
boolean canCommit(TaskAttemptId taskAttemptID);
{code}

There's a bunch of tricky logic that involves communication with the AM (see
AttemptCommitPendingTransition and the other transitions in TaskImpl), but it
looks like the gist is that the winner is picked by the AM through some
central coordination process.

So, it looks like the right fix is to implement these same state transitions
ourselves. It would be nice if there was a clean way to do this that could be
easily backported to maintenance branches.

Missing output partitions after job completes with speculative execution

Key: SPARK-4879
URL: https://issues.apache.org/jira/browse/SPARK-4879
Project: Spark
Issue Type: Bug
Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical

[jira] [Created] (SPARK-5275) pyspark.streaming is not included in assembly jar

2015-01-15 Thread Davies Liu (JIRA)

Davies Liu created SPARK-5275:
-

 Summary: pyspark.streaming is not included in assembly jar
 Key: SPARK-5275
 URL: https://issues.apache.org/jira/browse/SPARK-5275
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Priority: Blocker


The pyspark.streaming is not included in assembly jar of spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5276) pyspark.streaming is not included in assembly jar

2015-01-15 Thread Davies Liu (JIRA)

Davies Liu created SPARK-5276:
-

 Summary: pyspark.streaming is not included in assembly jar
 Key: SPARK-5276
 URL: https://issues.apache.org/jira/browse/SPARK-5276
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Priority: Blocker


The pyspark.streaming is not included in assembly jar of spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279274#comment-14279274
 ] 

Joseph K. Bradley commented on SPARK-1405:
--

I'll try out the statmt dataset if that will be easier for everyone to access.

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3622) Provide a custom transformation that can output multiple RDDs

2015-01-15 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279602#comment-14279602
 ] 

Imran Rashid commented on SPARK-3622:
-

In some ways this kinda reminds of the problem w/ accumulators and lazy 
transformations.  Accumulators are basically multiple output, but Spark itself 
provides no way to track when that output is ready.  Its up to the developer to 
figure it out.

If you do a transformation on {{rddA}} you've got to know to wait until 
you've also got a transformation on {{rddB}} ready as well.  Probably the 
simplest case for this is filtering records by some condition, but keeping both 
the good and bad records, ala scala collection's {{partition}} method.  I think 
this has come up on the user mailing list a few times.

What about having some new type {{MultiRDD}}, which only runs when you've 
queued up an action on *all* RDDs?  eg. something like:

{code}
val input: RDD[String] = ...
val goodAndBad: MultiRdd[String, String] = input.partition{ str = 
MyRecordParser.isOk(str)}
val bad: RDD[String] = goodAndBad.get(1)
bad.saveAsTextFile(...) // doesn't do anything yet
val parsed: RDD[MyCaseClass] = goodAndBad.get(0).map{str = 
MyRecordParser.parse(str)}
val tmp: RDD[MyCaseClass] = parsed.map{f1}.filter{f2}.mapPartitions{f3} //still 
don't do anything ...
val result = tmp.reduce{reduceFunc} // now everything gets run
{code}

 Provide a custom transformation that can output multiple RDDs
 -

 Key: SPARK-3622
 URL: https://issues.apache.org/jira/browse/SPARK-3622
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Xuefu Zhang

 All existing transformations return just one RDD at most, even for those 
 which takes user-supplied functions such as mapPartitions() . However, 
 sometimes a user provided function may need to output multiple RDDs. For 
 instance, a filter function that divides the input RDD into serveral RDDs. 
 While it's possible to get multiple RDDs by transforming the same RDD 
 multiple times, it may be more efficient to do this concurrently in one shot. 
 Especially user's existing function is already generating different data sets.
 This the case in Hive on Spark, where Hive's map function and reduce function 
 can output different data sets to be consumed by subsequent stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators

2015-01-15 Thread Max Seiden (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Seiden updated SPARK-5277:
--
Remaining Estimate: (was: 24h)
 Original Estimate: (was: 24h)

 SparkSqlSerializer does not register user specified KryoRegistrators 
 -

 Key: SPARK-5277
 URL: https://issues.apache.org/jira/browse/SPARK-5277
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Max Seiden

 Although the SparkSqlSerializer class extends the KryoSerializer in core, 
 it's overridden newKryo() does not call super.newKryo(). This results in 
 inconsistent serializer behaviors depending on whether a KryoSerializer 
 instance or a SparkSqlSerializer instance is used. This may also be related 
 to the TODO in KryoResourcePool, which uses KryoSerializer instead of 
 SparkSqlSerializer due to yet-to-be-investigated test failures.
 An example of the divergence in behavior: The Exchange operator creates a new 
 SparkSqlSerializer instance (with an empty conf; another issue) when it is 
 constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the 
 resource pool (see above). The result is that the serialized in-memory 
 columns are created using the user provided serializers / registrators, while 
 serialization during exchange does not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators

2015-01-15 Thread Max Seiden (JIRA)

Max Seiden created SPARK-5277:
-

 Summary: SparkSqlSerializer does not register user specified 
KryoRegistrators 
 Key: SPARK-5277
 URL: https://issues.apache.org/jira/browse/SPARK-5277
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Max Seiden


Although the SparkSqlSerializer class extends the KryoSerializer in core, it's 
overridden newKryo() does not call super.newKryo(). This results in 
inconsistent serializer behaviors depending on whether a KryoSerializer 
instance or a SparkSqlSerializer instance is used. This may also be related to 
the TODO in KryoResourcePool, which uses KryoSerializer instead of 
SparkSqlSerializer due to yet-to-be-investigated test failures.

An example of the divergence in behavior: The Exchange operator creates a new 
SparkSqlSerializer instance (with an empty conf; another issue) when it is 
constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the 
resource pool (see above). The result is that the serialized in-memory columns 
are created using the user provided serializers / registrators, while 
serialization during exchange does not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5011) Add support for WITH SERDEPROPERTIES, TBLPROPERTIES in CREATE TEMPORARY TABLE

2015-01-15 Thread shengli (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shengli closed SPARK-5011.
--
Resolution: Later

 Add support for WITH SERDEPROPERTIES, TBLPROPERTIES in CREATE TEMPORARY TABLE
 -

 Key: SPARK-5011
 URL: https://issues.apache.org/jira/browse/SPARK-5011
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.2.1
Reporter: shengli
Priority: Minor
 Fix For: 1.2.1

   Original Estimate: 96h
  Remaining Estimate: 96h

 For external datasource integration.
 We have two kinds of datasource:
 1. File : like avro, json, parquet, etc..
 2. Database: like hbase, cassandra etc...
 For `File`, there is not too much configurations. Using Options Syntax is ok.
 But for Database we usually have many configuration in different levels.
 We need to support `WITH SERDEPROPERTIES` and `TBLPROPERTIES` syntax.
 Like Hive HBase:
 ```
 CREATE TABLE hbase_table_1(key int, value string) 
 STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
 WITH SERDEPROPERTIES (hbase.columns.mapping = :key,cf1:val)
 TBLPROPERTIES (hbase.table.name = xyz);
 ```
 refer links:
 https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4857) Add Executor Events to SparkListener


 [ 
https://issues.apache.org/jira/browse/SPARK-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4857.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Add Executor Events to SparkListener
 

 Key: SPARK-4857
 URL: https://issues.apache.org/jira/browse/SPARK-4857
 Project: Spark
  Issue Type: Improvement
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis
 Fix For: 1.3.0


 We need to add events to the SparkListener to indicate an executor has been 
 added or removed with corresponding information. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution


[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279711#comment-14279711
 ] 

Apache Spark commented on SPARK-4879:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4066

 Missing output partitions after job completes with speculative execution
 

 Key: SPARK-4879
 URL: https://issues.apache.org/jira/browse/SPARK-4879
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, Spark Core
Affects Versions: 1.0.2, 1.1.1, 1.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
 Attachments: speculation.txt, speculation2.txt


 When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
 save output files may report that they have completed successfully even 
 though some output partitions written by speculative tasks may be missing.
 h3. Reproduction
 This symptom was reported to me by a Spark user and I've been doing my own 
 investigation to try to come up with an in-house reproduction.
 I'm still working on a reliable local reproduction for this issue, which is a 
 little tricky because Spark won't schedule speculated tasks on the same host 
 as the original task, so you need an actual (or containerized) multi-host 
 cluster to test speculation.  Here's a simple reproduction of some of the 
 symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
 spark.speculation=true}}:
 {code}
 // Rig a job such that all but one of the tasks complete instantly
 // and one task runs for 20 seconds on its first attempt and instantly
 // on its second attempt:
 val numTasks = 100
 sc.parallelize(1 to numTasks, 
 numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =
   if (ctx.partitionId == 0) {  // If this is the one task that should run 
 really slow
 if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
  Thread.sleep(20 * 1000)
 }
   }
   iter
 }.map(x = (x, x)).saveAsTextFile(/test4)
 {code}
 When I run this, I end up with a job that completes quickly (due to 
 speculation) but reports failures from the speculated task:
 {code}
 [...]
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
 (100/100)
 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
 console:22) finished in 0.856 s
 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
 console:22, took 0.885438374 s
 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
 for 70.1 in stage 3.0 because task 70 has already completed successfully
 scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
 stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
 java.io.IOException: Failed to save output of task: 
 attempt_201412110141_0003_m_49_413
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
 
 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
 
 org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
 org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)
 
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 One interesting thing to note about this stack trace: if we look at 
 {{FileOutputCommitter.java:160}} 
 ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]),
  this point in the execution seems to correspond to a case where a task 
 completes, attempts to commit its output, fails for some reason, then deletes 
 the destination file, tries again, and fails:
 {code}
  if (fs.isFile(taskOutput)) {
 152  Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput, 
 153  getTempTaskOutputPath(context));
 154  if (!fs.rename(taskOutput, finalOutputPath)) {

[jira] [Commented] (SPARK-4874) Report number of records read/written in a task


[ 
https://issues.apache.org/jira/browse/SPARK-4874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279712#comment-14279712
 ] 

Apache Spark commented on SPARK-4874:
-

User 'ksakellis' has created a pull request for this issue:
https://github.com/apache/spark/pull/4067

 Report number of records read/written in a task
 ---

 Key: SPARK-4874
 URL: https://issues.apache.org/jira/browse/SPARK-4874
 Project: Spark
  Issue Type: Improvement
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis

 This metric will help us find key skew using the WebUI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5012) Python API for Gaussian Mixture Model

2015-01-15 Thread Meethu Mathew (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279811#comment-14279811
 ] 

Meethu Mathew commented on SPARK-5012:
--

Once SPARK-5019 is resolved, we will make the changes accordingly.Thanks 
[~josephkb] [~tgaloppo] for the comments

 Python API for Gaussian Mixture Model
 -

 Key: SPARK-5012
 URL: https://issues.apache.org/jira/browse/SPARK-5012
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Meethu Mathew

 Add Python API for the Scala implementation of GMM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5278) ambiguous reference to fields in Spark SQL is incompleted

Wenchen Fan created SPARK-5278:
--

 Summary: ambiguous reference to fields in Spark SQL is incompleted
 Key: SPARK-5278
 URL: https://issues.apache.org/jira/browse/SPARK-5278
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan


for json string like
{a:[
  {
 b: 1,
  B: 2
   }
}
The SQL `SELECT a.b from t` will report error for ambiguous reference to fields.
But for json string like
{a:[
  {
b: 1,
B: 2
  }]
}
The SQL `SELECT a[0].b from t` will pass and pick the first `b`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5278) ambiguous reference to fields in Spark SQL is incompleted


 [ 
https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-5278:
---
Description: 
for json string like
{a: {b: 1, B: 2}}
The SQL `SELECT a.b from t` will report error for ambiguous reference to fields.
But for json string like
{a: [{b: 1, B: 2}]}
The SQL `SELECT a[0].b from t` will pass and pick the first `b`

  was:
for json string like
{a: {b: 1, B: 2}}
The SQL `SELECT a.b from t` will report error for ambiguous reference to fields.
But for json string like
{a:[{b: 1, B: 2}]}
The SQL `SELECT a[0].b from t` will pass and pick the first `b`


 ambiguous reference to fields in Spark SQL is incompleted
 -

 Key: SPARK-5278
 URL: https://issues.apache.org/jira/browse/SPARK-5278
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 for json string like
 {a: {b: 1, B: 2}}
 The SQL `SELECT a.b from t` will report error for ambiguous reference to 
 fields.
 But for json string like
 {a: [{b: 1, B: 2}]}
 The SQL `SELECT a[0].b from t` will pass and pick the first `b`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5278) ambiguous reference to fields in Spark SQL is incompleted


 [ 
https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-5278:
---
Description: 
for json string like
{a: {b: 1, B: 2}}
The SQL `SELECT a.b from t` will report error for ambiguous reference to fields.
But for json string like
{a:[{b: 1, B: 2}]}
The SQL `SELECT a[0].b from t` will pass and pick the first `b`

  was:
for json string like
{a:[
  {
 b: 1,
  B: 2
   }
}
The SQL `SELECT a.b from t` will report error for ambiguous reference to fields.
But for json string like
{a:[
  {
b: 1,
B: 2
  }]
}
The SQL `SELECT a[0].b from t` will pass and pick the first `b`


 ambiguous reference to fields in Spark SQL is incompleted
 -

 Key: SPARK-5278
 URL: https://issues.apache.org/jira/browse/SPARK-5278
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 for json string like
 {a: {b: 1, B: 2}}
 The SQL `SELECT a.b from t` will report error for ambiguous reference to 
 fields.
 But for json string like
 {a:[{b: 1, B: 2}]}
 The SQL `SELECT a[0].b from t` will pass and pick the first `b`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2630) Input data size of CoalescedRDD is incorrect


 [ 
https://issues.apache.org/jira/browse/SPARK-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2630.

Resolution: Duplicate

I think this is a dup of SPARK-4092.

 Input data size of CoalescedRDD is incorrect
 

 Key: SPARK-2630
 URL: https://issues.apache.org/jira/browse/SPARK-2630
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0, 1.0.1
Reporter: Davies Liu
Assignee: Andrew Ash
Priority: Blocker
 Attachments: overflow.tiff


 Given one big file, such as text.4.3G, put it in one task, 
 {code}
 sc.textFile(text.4.3.G).coalesce(1).count()
 {code}
 In Web UI of Spark, you will see that the input size is 5.4M. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4955) Dynamic allocation doesn't work in YARN cluster mode


 [ 
https://issues.apache.org/jira/browse/SPARK-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4955:
---
Priority: Blocker  (was: Critical)

 Dynamic allocation doesn't work in YARN cluster mode
 

 Key: SPARK-4955
 URL: https://issues.apache.org/jira/browse/SPARK-4955
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Chengxiang Li
Assignee: Lianhui Wang
Priority: Blocker

 With executor dynamic scaling enabled, in yarn-cluster mode, after query 
 finished and spark.dynamicAllocation.executorIdleTimeout interval, executor 
 number is not reduced to configured min number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4955) Dynamic allocation doesn't work in YARN cluster mode


 [ 
https://issues.apache.org/jira/browse/SPARK-4955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4955:
---
Target Version/s: 1.3.0

 Dynamic allocation doesn't work in YARN cluster mode
 

 Key: SPARK-4955
 URL: https://issues.apache.org/jira/browse/SPARK-4955
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Chengxiang Li
Assignee: Lianhui Wang
Priority: Blocker

 With executor dynamic scaling enabled, in yarn-cluster mode, after query 
 finished and spark.dynamicAllocation.executorIdleTimeout interval, executor 
 number is not reduced to configured min number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5216) Spark Ui should report estimated time remaining for each stage.

[
https://issues.apache.org/jira/browse/SPARK-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279863#comment-14279863
]

Patrick Wendell commented on SPARK-5216:

This has been proposed before, but in the past we decided not to do it. Trying
to extrapolate the finish time of a stage accurately is basically impossible
since in many workloads stragglers dominate the total response time. The
conclusion was that it was better to give no estimate rather than one which is
likely to be misleading.

Spark Ui should report estimated time remaining for each stage.
---

Key: SPARK-5216
URL: https://issues.apache.org/jira/browse/SPARK-5216
Project: Spark
Issue Type: Wish
Components: Spark Core, Web UI
Affects Versions: 1.3.0
Reporter: Prashant Sharma
Assignee: Prashant Sharma

Per stage feedback on estimated remaining time can help user get a grasp on
how much time the job is going to take. This will only require changes on the
UI/JobProgressListener side of code since we already have most of the
information needed.
In the initial cut, plan is to estimate time based on statistics of running
job i.e. average time taken by each task and number of task per stage. This
will makes sense when jobs are long. And then if this makes sense, then more
heuristics can be added like projected time saved if the rdd is cached and so
on.
More precise details will come as this evolves. In the meantime thoughts on
alternate ways and suggestion on usefulness are welcome.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5279) Use java.math.BigDecimal as the exposed Decimal type

Reynold Xin created SPARK-5279:
--

 Summary: Use java.math.BigDecimal as the exposed Decimal type
 Key: SPARK-5279
 URL: https://issues.apache.org/jira/browse/SPARK-5279
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


Change it from scala.BigDecimal to java.math.BigDecimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5278) ambiguous reference to fields in Spark SQL is incompleted


 [ 
https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-5278:
---
Description: 
for json string like
{a: {b: 1, B: 2}}
The SQL `SELECT a.b from t` will report error for ambiguous reference to fields.
But for json string like
{a: [ {b: 1, B: 2} ]}
The SQL `SELECT a[0].b from t` will pass and pick the first `b`

  was:
for json string like
{a: {b: 1, B: 2}}
The SQL `SELECT a.b from t` will report error for ambiguous reference to fields.
But for json string like
{a: [{b: 1, B: 2}]}
The SQL `SELECT a[0].b from t` will pass and pick the first `b`


 ambiguous reference to fields in Spark SQL is incompleted
 -

 Key: SPARK-5278
 URL: https://issues.apache.org/jira/browse/SPARK-5278
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 for json string like
 {a: {b: 1, B: 2}}
 The SQL `SELECT a.b from t` will report error for ambiguous reference to 
 fields.
 But for json string like
 {a: [ {b: 1, B: 2} ]}
 The SQL `SELECT a[0].b from t` will pass and pick the first `b`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5176) Thrift server fails with confusing error message when deploy-mode is cluster


 [ 
https://issues.apache.org/jira/browse/SPARK-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5176:
---
Labels: starter  (was: )

 Thrift server fails with confusing error message when deploy-mode is cluster
 

 Key: SPARK-5176
 URL: https://issues.apache.org/jira/browse/SPARK-5176
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Tom Panning
  Labels: starter

 With Spark 1.2.0, when I try to run
 {noformat}
 $SPARK_HOME/sbin/start-thriftserver.sh --deploy-mode cluster --master 
 spark://xd-spark.xdata.data-tactics-corp.com:7077
 {noformat}
 The log output is
 {noformat}
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Spark Command: /usr/java/latest/bin/java -cp 
 ::/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar
  -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit 
 --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 
 --deploy-mode cluster --master 
 spark://xd-spark.xdata.data-tactics-corp.com:7077 spark-internal
 
 Jar url 'spark-internal' is not in valid format.
 Must be a jar file path in URL format (e.g. hdfs://host:port/XX.jar, 
 file:///XX.jar)
 Usage: DriverClient [options] launch active-master jar-url main-class 
 [driver options]
 Usage: DriverClient kill active-master driver-id
 Options:
-c CORES, --cores CORESNumber of cores to request (default: 1)
-m MEMORY, --memory MEMORY Megabytes of memory to request (default: 
 512)
-s, --superviseWhether to restart the driver on failure
-v, --verbose  Print more debugging output
  
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 {noformat}
 I do not get this error if deploy-mode is set to client. The --deploy-mode 
 option is described by the --help output, so I expected it to work. I 
 checked, and this behavior seems to be present in Spark 1.1.0 as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5176) Thrift server fails with confusing error message when deploy-mode is cluster


[ 
https://issues.apache.org/jira/browse/SPARK-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279869#comment-14279869
 ] 

Patrick Wendell commented on SPARK-5176:


Yes, we should add a check here similar to the existing ones for the 
thriftserver class:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L143

 Thrift server fails with confusing error message when deploy-mode is cluster
 

 Key: SPARK-5176
 URL: https://issues.apache.org/jira/browse/SPARK-5176
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Tom Panning
  Labels: starter

 With Spark 1.2.0, when I try to run
 {noformat}
 $SPARK_HOME/sbin/start-thriftserver.sh --deploy-mode cluster --master 
 spark://xd-spark.xdata.data-tactics-corp.com:7077
 {noformat}
 The log output is
 {noformat}
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Spark Command: /usr/java/latest/bin/java -cp 
 ::/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar
  -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit 
 --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 
 --deploy-mode cluster --master 
 spark://xd-spark.xdata.data-tactics-corp.com:7077 spark-internal
 
 Jar url 'spark-internal' is not in valid format.
 Must be a jar file path in URL format (e.g. hdfs://host:port/XX.jar, 
 file:///XX.jar)
 Usage: DriverClient [options] launch active-master jar-url main-class 
 [driver options]
 Usage: DriverClient kill active-master driver-id
 Options:
-c CORES, --cores CORESNumber of cores to request (default: 1)
-m MEMORY, --memory MEMORY Megabytes of memory to request (default: 
 512)
-s, --superviseWhether to restart the driver on failure
-v, --verbose  Print more debugging output
  
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 {noformat}
 I do not get this error if deploy-mode is set to client. The --deploy-mode 
 option is described by the --help output, so I expected it to work. I 
 checked, and this behavior seems to be present in Spark 1.1.0 as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5278) ambiguous reference to fields in Spark SQL is incompleted


 [ 
https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-5278:
---
Description: 
at hive context
for json string like
{a: {b: 1, B: 2}}
The SQL `SELECT a.b from t` will report error for ambiguous reference to fields.
But for json string like
{a: [ {b: 1, B: 2} ]}
The SQL `SELECT a[0].b from t` will pass and pick the first `b`

  was:
for json string like
{a: {b: 1, B: 2}}
The SQL `SELECT a.b from t` will report error for ambiguous reference to fields.
But for json string like
{a: [ {b: 1, B: 2} ]}
The SQL `SELECT a[0].b from t` will pass and pick the first `b`


 ambiguous reference to fields in Spark SQL is incompleted
 -

 Key: SPARK-5278
 URL: https://issues.apache.org/jira/browse/SPARK-5278
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 at hive context
 for json string like
 {a: {b: 1, B: 2}}
 The SQL `SELECT a.b from t` will report error for ambiguous reference to 
 fields.
 But for json string like
 {a: [ {b: 1, B: 2} ]}
 The SQL `SELECT a[0].b from t` will pass and pick the first `b`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5176) Thrift server fails with confusing error message when deploy-mode is cluster


[ 
https://issues.apache.org/jira/browse/SPARK-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279869#comment-14279869
 ] 

Patrick Wendell edited comment on SPARK-5176 at 1/16/15 6:28 AM:
-

Yes, we should add a check here similar to the existing ones for the 
thriftserver class:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L143

[~tpanning] are you interested in contributing this? If not, someone else will 
pick it up.


was (Author: pwendell):
Yes, we should add a check here similar to the existing ones for the 
thriftserver class:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L143

 Thrift server fails with confusing error message when deploy-mode is cluster
 

 Key: SPARK-5176
 URL: https://issues.apache.org/jira/browse/SPARK-5176
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Tom Panning
  Labels: starter

 With Spark 1.2.0, when I try to run
 {noformat}
 $SPARK_HOME/sbin/start-thriftserver.sh --deploy-mode cluster --master 
 spark://xd-spark.xdata.data-tactics-corp.com:7077
 {noformat}
 The log output is
 {noformat}
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Spark Command: /usr/java/latest/bin/java -cp 
 ::/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar
  -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit 
 --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 
 --deploy-mode cluster --master 
 spark://xd-spark.xdata.data-tactics-corp.com:7077 spark-internal
 
 Jar url 'spark-internal' is not in valid format.
 Must be a jar file path in URL format (e.g. hdfs://host:port/XX.jar, 
 file:///XX.jar)
 Usage: DriverClient [options] launch active-master jar-url main-class 
 [driver options]
 Usage: DriverClient kill active-master driver-id
 Options:
-c CORES, --cores CORESNumber of cores to request (default: 1)
-m MEMORY, --memory MEMORY Megabytes of memory to request (default: 
 512)
-s, --superviseWhether to restart the driver on failure
-v, --verbose  Print more debugging output
  
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 {noformat}
 I do not get this error if deploy-mode is set to client. The --deploy-mode 
 option is described by the --help output, so I expected it to work. I 
 checked, and this behavior seems to be present in Spark 1.1.0 as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5278) ambiguous reference to fields in Spark SQL is incompleted


 [ 
https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-5278:
---
Description: 
at hive context

for json string like
{code}{a: {b: 1, B: 2}}{code}
The SQL `SELECT a.b from t` will report error for ambiguous reference to fields.
But for json string like
{code}{a: [{b: 1, B: 2}]}{code}
The SQL `SELECT a[0].b from t` will pass and pick the first `b`

  was:
at hive context
for json string like
{a: {b: 1, B: 2}}
The SQL `SELECT a.b from t` will report error for ambiguous reference to fields.
But for json string like
{a: [ {b: 1, B: 2} ]}
The SQL `SELECT a[0].b from t` will pass and pick the first `b`


 ambiguous reference to fields in Spark SQL is incompleted
 -

 Key: SPARK-5278
 URL: https://issues.apache.org/jira/browse/SPARK-5278
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 at hive context
 for json string like
 {code}{a: {b: 1, B: 2}}{code}
 The SQL `SELECT a.b from t` will report error for ambiguous reference to 
 fields.
 But for json string like
 {code}{a: [{b: 1, B: 2}]}{code}
 The SQL `SELECT a[0].b from t` will pass and pick the first `b`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-01-15 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279883#comment-14279883
 ] 

Yin Huai commented on SPARK-5260:
-

[~sonixbp] If you like, you can make the change and create a pull request. I 
can help you on that.

btw, just a note. We do not add fix version(s) until it has been merged into 
our code base.

 Expose JsonRDD.allKeysWithValueTypes() in a utility class 
 --

 Key: SPARK-5260
 URL: https://issues.apache.org/jira/browse/SPARK-5260
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Corey J. Nolet
 Fix For: 1.3.0


 I have found this method extremely useful when implementing my own strategy 
 for inferring a schema from parsed json. For now, I've actually copied the 
 method right out of the JsonRDD class into my own project but I think it 
 would be immensely useful to keep the code in Spark and expose it publicly 
 somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5278) ambiguous reference to fields in Spark SQL is incompleted


[ 
https://issues.apache.org/jira/browse/SPARK-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279892#comment-14279892
 ] 

Apache Spark commented on SPARK-5278:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/4068

 ambiguous reference to fields in Spark SQL is incompleted
 -

 Key: SPARK-5278
 URL: https://issues.apache.org/jira/browse/SPARK-5278
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan

 at hive context
 for json string like
 {code}{a: {b: 1, B: 2}}{code}
 The SQL `SELECT a.b from t` will report error for ambiguous reference to 
 fields.
 But for json string like
 {code}{a: [{b: 1, B: 2}]}{code}
 The SQL `SELECT a[0].b from t` will pass and pick the first `b`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5251) Using `tableIdentifier` in hive metastore

2015-01-15 Thread wangfei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-5251:
---
Target Version/s: 1.3.0

 Using `tableIdentifier` in hive metastore 
 --

 Key: SPARK-5251
 URL: https://issues.apache.org/jira/browse/SPARK-5251
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei

 Using `tableIdentifier` in hive metastore 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5251) Using `tableIdentifier` in hive metastore

2015-01-15 Thread wangfei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-5251:
---
Target Version/s:   (was: 1.3.0)

 Using `tableIdentifier` in hive metastore 
 --

 Key: SPARK-5251
 URL: https://issues.apache.org/jira/browse/SPARK-5251
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: wangfei

 Using `tableIdentifier` in hive metastore 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2686) Add Length support to Spark SQL and HQL and Strlen support to SQL


[ 
https://issues.apache.org/jira/browse/SPARK-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279914#comment-14279914
 ] 

Reynold Xin commented on SPARK-2686:


Do you mind closing the pull request? I will reopen the ticket.

 Add Length support to Spark SQL and HQL and Strlen support to SQL
 -

 Key: SPARK-2686
 URL: https://issues.apache.org/jira/browse/SPARK-2686
 Project: Spark
  Issue Type: Improvement
  Components: SQL
 Environment: all
Reporter: Stephen Boesch
Priority: Minor
  Labels: hql, length, sql
   Original Estimate: 0h
  Remaining Estimate: 0h

 Syntactic, parsing, and operational support have been added for LEN(GTH) and 
 STRLEN functions.
 Examples:
 SQL:
 import org.apache.spark.sql._
 case class TestData(key: Int, value: String)
 val sqlc = new SQLContext(sc)
 import sqlc._
   val testData: SchemaRDD = sqlc.sparkContext.parallelize(
 (1 to 100).map(i = TestData(i, i.toString)))
   testData.registerAsTable(testData)
 sqlc.sql(select length(key) as key_len from testData order by key_len desc 
 limit 5).collect
 res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2])
 HQL:
 val hc = new org.apache.spark.sql.hive.HiveContext(sc)
 import hc._
 hc.hql
 hql(select length(grp) from simplex).collect
 res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6])
 As far as codebase changes: they have been purposefully made similar to the 
 ones made for  for adding SUBSTR(ING) from July 17:
 SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main 
 classes changed.  The testing suites affected are ConstantFolding and  
 ExpressionEvaluation.
 In addition some ad-hoc testing was done as shown in the examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-2686) Add Length support to Spark SQL and HQL and Strlen support to SQL


 [ 
https://issues.apache.org/jira/browse/SPARK-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-2686:


 Add Length support to Spark SQL and HQL and Strlen support to SQL
 -

 Key: SPARK-2686
 URL: https://issues.apache.org/jira/browse/SPARK-2686
 Project: Spark
  Issue Type: Improvement
  Components: SQL
 Environment: all
Reporter: Stephen Boesch
Priority: Minor
  Labels: hql, length, sql
   Original Estimate: 0h
  Remaining Estimate: 0h

 Syntactic, parsing, and operational support have been added for LEN(GTH) and 
 STRLEN functions.
 Examples:
 SQL:
 import org.apache.spark.sql._
 case class TestData(key: Int, value: String)
 val sqlc = new SQLContext(sc)
 import sqlc._
   val testData: SchemaRDD = sqlc.sparkContext.parallelize(
 (1 to 100).map(i = TestData(i, i.toString)))
   testData.registerAsTable(testData)
 sqlc.sql(select length(key) as key_len from testData order by key_len desc 
 limit 5).collect
 res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2])
 HQL:
 val hc = new org.apache.spark.sql.hive.HiveContext(sc)
 import hc._
 hc.hql
 hql(select length(grp) from simplex).collect
 res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6])
 As far as codebase changes: they have been purposefully made similar to the 
 ones made for  for adding SUBSTR(ING) from July 17:
 SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main 
 classes changed.  The testing suites affected are ConstantFolding and  
 ExpressionEvaluation.
 In addition some ad-hoc testing was done as shown in the examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2686) Add Length support to Spark SQL and HQL and Strlen support to SQL


[ 
https://issues.apache.org/jira/browse/SPARK-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279913#comment-14279913
 ] 

Reynold Xin commented on SPARK-2686:


[~javadba] I think Michael meant closing the pull request, but not the ticket 
...


 Add Length support to Spark SQL and HQL and Strlen support to SQL
 -

 Key: SPARK-2686
 URL: https://issues.apache.org/jira/browse/SPARK-2686
 Project: Spark
  Issue Type: Improvement
  Components: SQL
 Environment: all
Reporter: Stephen Boesch
Priority: Minor
  Labels: hql, length, sql
   Original Estimate: 0h
  Remaining Estimate: 0h

 Syntactic, parsing, and operational support have been added for LEN(GTH) and 
 STRLEN functions.
 Examples:
 SQL:
 import org.apache.spark.sql._
 case class TestData(key: Int, value: String)
 val sqlc = new SQLContext(sc)
 import sqlc._
   val testData: SchemaRDD = sqlc.sparkContext.parallelize(
 (1 to 100).map(i = TestData(i, i.toString)))
   testData.registerAsTable(testData)
 sqlc.sql(select length(key) as key_len from testData order by key_len desc 
 limit 5).collect
 res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2])
 HQL:
 val hc = new org.apache.spark.sql.hive.HiveContext(sc)
 import hc._
 hc.hql
 hql(select length(grp) from simplex).collect
 res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6])
 As far as codebase changes: they have been purposefully made similar to the 
 ones made for  for adding SUBSTR(ING) from July 17:
 SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main 
 classes changed.  The testing suites affected are ConstantFolding and  
 ExpressionEvaluation.
 In addition some ad-hoc testing was done as shown in the examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4867) UDF clean up

[
https://issues.apache.org/jira/browse/SPARK-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Reynold Xin updated SPARK-4867:
---
Issue Type: Sub-task (was: Bug)
Parent: SPARK-5166

UDF clean up

Key: SPARK-4867
URL: https://issues.apache.org/jira/browse/SPARK-4867
Project: Spark
Issue Type: Sub-task
Components: SQL
Reporter: Michael Armbrust
Priority: Blocker

Right now our support and internal implementation of many functions has a few
issues. Specifically:
- UDFS don't know their input types and thus don't do type coercion.
- We hard code a bunch of built in functions into the parser. This is bad
because in SQL it creates new reserved words for things that aren't actually
keywords. Also it means that for each function we need to add support to
both SQLContext and HiveContext separately.
For this JIRA I propose we do the following:
- Change the interfaces for registerFunction and ScalaUdf to include types
for the input arguments as well as the output type.
- Add a rule to analysis that does type coercion for UDFs.
- Add a parse rule for functions to SQLParser.
- Rewrite all the UDFs that are currently hacked into the various parsers
using this new functionality.
Depending on how big this refactoring becomes we could split parts 12 from
part 3 above.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5211) Restore HiveMetastoreTypes.toDataType


 [ 
https://issues.apache.org/jira/browse/SPARK-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5211.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Yin Huai

 Restore HiveMetastoreTypes.toDataType
 -

 Key: SPARK-5211
 URL: https://issues.apache.org/jira/browse/SPARK-5211
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical
 Fix For: 1.3.0


 It was a public API. Since developers are using it, we need to get it back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2686) Add Length support to Spark SQL and HQL and Strlen support to SQL

2015-01-15 Thread Stephen Boesch (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279920#comment-14279920
 ] 

Stephen Boesch commented on SPARK-2686:
---

ok closed




 Add Length support to Spark SQL and HQL and Strlen support to SQL
 -

 Key: SPARK-2686
 URL: https://issues.apache.org/jira/browse/SPARK-2686
 Project: Spark
  Issue Type: Improvement
  Components: SQL
 Environment: all
Reporter: Stephen Boesch
Priority: Minor
  Labels: hql, length, sql
   Original Estimate: 0h
  Remaining Estimate: 0h

 Syntactic, parsing, and operational support have been added for LEN(GTH) and 
 STRLEN functions.
 Examples:
 SQL:
 import org.apache.spark.sql._
 case class TestData(key: Int, value: String)
 val sqlc = new SQLContext(sc)
 import sqlc._
   val testData: SchemaRDD = sqlc.sparkContext.parallelize(
 (1 to 100).map(i = TestData(i, i.toString)))
   testData.registerAsTable(testData)
 sqlc.sql(select length(key) as key_len from testData order by key_len desc 
 limit 5).collect
 res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2])
 HQL:
 val hc = new org.apache.spark.sql.hive.HiveContext(sc)
 import hc._
 hc.hql
 hql(select length(grp) from simplex).collect
 res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6])
 As far as codebase changes: they have been purposefully made similar to the 
 ones made for  for adding SUBSTR(ING) from July 17:
 SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main 
 classes changed.  The testing suites affected are ConstantFolding and  
 ExpressionEvaluation.
 In addition some ad-hoc testing was done as shown in the examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2686) Add Length support to Spark SQL and HQL and Strlen support to SQL


[ 
https://issues.apache.org/jira/browse/SPARK-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279923#comment-14279923
 ] 

Reynold Xin commented on SPARK-2686:


Thanks. Let's pull it in once SPARK-4867 is fixed.

 Add Length support to Spark SQL and HQL and Strlen support to SQL
 -

 Key: SPARK-2686
 URL: https://issues.apache.org/jira/browse/SPARK-2686
 Project: Spark
  Issue Type: Improvement
  Components: SQL
 Environment: all
Reporter: Stephen Boesch
Priority: Minor
  Labels: hql, length, sql
   Original Estimate: 0h
  Remaining Estimate: 0h

 Syntactic, parsing, and operational support have been added for LEN(GTH) and 
 STRLEN functions.
 Examples:
 SQL:
 import org.apache.spark.sql._
 case class TestData(key: Int, value: String)
 val sqlc = new SQLContext(sc)
 import sqlc._
   val testData: SchemaRDD = sqlc.sparkContext.parallelize(
 (1 to 100).map(i = TestData(i, i.toString)))
   testData.registerAsTable(testData)
 sqlc.sql(select length(key) as key_len from testData order by key_len desc 
 limit 5).collect
 res12: Array[org.apache.spark.sql.Row] = Array([3], [2], [2], [2], [2])
 HQL:
 val hc = new org.apache.spark.sql.hive.HiveContext(sc)
 import hc._
 hc.hql
 hql(select length(grp) from simplex).collect
 res14: Array[org.apache.spark.sql.Row] = Array([6], [6], [6], [6])
 As far as codebase changes: they have been purposefully made similar to the 
 ones made for  for adding SUBSTR(ING) from July 17:
 SQLParser, Optimizer, Expression, stringOperations, and HiveQL were the main 
 classes changed.  The testing suites affected are ConstantFolding and  
 ExpressionEvaluation.
 In addition some ad-hoc testing was done as shown in the examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5262) coalesce should allow NullType and 1 another type in parameters

2015-01-15 Thread Adrian Wang (JIRA)

Adrian Wang created SPARK-5262:
--

 Summary: coalesce should allow NullType and 1 another type in 
parameters
 Key: SPARK-5262
 URL: https://issues.apache.org/jira/browse/SPARK-5262
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang


Currently Coalesce(null, 1, null) would throw exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5262) coalesce should allow NullType and 1 another type in parameters


[ 
https://issues.apache.org/jira/browse/SPARK-5262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278416#comment-14278416
 ] 

Apache Spark commented on SPARK-5262:
-

User 'adrian-wang' has created a pull request for this issue:
https://github.com/apache/spark/pull/4057

 coalesce should allow NullType and 1 another type in parameters
 ---

 Key: SPARK-5262
 URL: https://issues.apache.org/jira/browse/SPARK-5262
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang

 Currently Coalesce(null, 1, null) would throw exceptions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1084) Fix most build warnings


 [ 
https://issues.apache.org/jira/browse/SPARK-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1084:
--
Reporter: Sean Owen  (was: Sean Owen)

 Fix most build warnings
 ---

 Key: SPARK-1084
 URL: https://issues.apache.org/jira/browse/SPARK-1084
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 0.9.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
  Labels: mvn, sbt, warning
 Fix For: 1.0.0


 I hope another boring tidy-up JIRA might be welcome. I'd like to fix most of 
 the warnings that appear during build, so that developers don't become 
 accustomed to them. The accompanying pull request contains a number of 
 commits to quash most warnings observed through the mvn and sbt builds, 
 although not all of them.
 FIXED!
 [WARNING] Parameter tasks is deprecated, use target instead
 Just a matter of updating tasks - target in inline Ant scripts.
 WARNING: -p has been deprecated and will be reused for a different (but still 
 very cool) purpose in ScalaTest 2.0. Please change all uses of -p to -R.
 Goes away with updating scalatest plugin - 1.0-RC2
 [WARNING] Note: 
 /Users/srowen/Documents/incubator-spark/core/src/test/scala/org/apache/spark/JavaAPISuite.java
  uses unchecked or unsafe operations.
 [WARNING] Note: Recompile with -Xlint:unchecked for details.
 Mostly @SuppressWarnings(unchecked) but needed a few more things to reveal 
 the warning source: forktrue/fork (also needd for maxmem) and version 
 3.1 of the plugin. In a few cases some declaration changes were appropriate 
 to avoid warnings.
 /Users/srowen/Documents/incubator-spark/core/src/main/scala/org/apache/spark/util/IndestructibleActorSystem.scala:25:
  warning: Could not find any member to link for akka.actor.ActorSystem.
 /**
 ^
 Getting several scaladoc errors like this and I'm not clear why it can't find 
 the type -- outside its module? Remove the links as they're evidently not 
 linking anyway?
 /Users/srowen/Documents/incubator-spark/repl/src/main/scala/org/apache/spark/repl/SparkIMain.scala:86:
  warning: Variable eval undefined in comment for class SparkIMain in class 
 SparkIMain
 $ has to be escaped as \$ in scaladoc, apparently
 [WARNING] 
 'dependencyManagement.dependencies.dependency.exclusions.exclusion.artifactId'
  for org.apache.hadoop:hadoop-yarn-client:jar with value '*' does not match a 
 valid id pattern. @ org.apache.spark:spark-parent:1.0.0-incubating-SNAPSHOT, 
 /Users/srowen/Documents/incubator-spark/pom.xml, line 494, column 25
 This one might need review.
 This is valid Maven syntax, but, Maven still warns on it. I wanted to see if 
 we can do without it. 
 These are trying to exclude:
 - org.codehaus.jackson
 - org.sonatype.sisu.inject
 - org.xerial.snappy
 org.sonatype.sisu.inject doesn't actually seem to be a dependency anyway. 
 org.xerial.snappy is used by dependencies but the version seems to match 
 anyway (1.0.5).
 org.codehaus.jackson was intended to exclude 1.8.8, since Spark streaming 
 wants 1.9.11 directly. But the exclusion is in the wrong place if so, since 
 Spark depends straight on Avro, which is what brings in 1.8.8, still. 
 (hadoop-client 1.0.4 includes Jackson 1.0.1, so that needs an exclusion, but 
 the other Hadoop modules don't.)
 HBase depends on 1.8.8 but figured it was intentional to leave that as it 
 would not collide with Spark streaming. (?)
 (I understand this varies by Hadoop version but confirmed this is all the 
 same for 1.0.4, 0.23.7, 2.2.0.)
 NOT FIXED.
 [warn] 
 /Users/srowen/Documents/incubator-spark/streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala:305:
  method connect in class IOManager is deprecated: use the new implementation 
 in package akka.io instead
 [warn]   override def preStart = IOManager(context.system).connect(new 
 InetSocketAddress(port))
 Not confident enough to fix this.
 [WARNING] there were 6 feature warning(s); re-run with -feature for details
 Don't know enough Scala to address these, yet.
 [WARNING] We have a duplicate 
 org/yaml/snakeyaml/scanner/ScannerImpl$Chomping.class in 
 /Users/srowen/.m2/repository/org/yaml/snakeyaml/1.6/snakeyaml-1.6.jar
 Probably addressable by being more careful about how binaries are packed 
 though this appear to be ignorable; two identical copies of the class are 
 colliding.
 [WARNING] Zinc server is not available at port 3030 - reverting to normal 
 incremental compile
 and
 [WARNING] JAR will be empty - no content was marked for inclusion!
 Apparently harmless warnings, but I don't know how to disable them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional

[jira] [Updated] (SPARK-1181) 'mvn test' fails out of the box since sbt assembly does not necessarily exist


 [ 
https://issues.apache.org/jira/browse/SPARK-1181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1181:
--
Reporter: Sean Owen  (was: Sean Owen)

 'mvn test' fails out of the box since sbt assembly does not necessarily exist
 -

 Key: SPARK-1181
 URL: https://issues.apache.org/jira/browse/SPARK-1181
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.0
Reporter: Sean Owen
  Labels: assembly, maven, sbt, test

 The test suite requires that sbt assembly has been run in order for some 
 tests (like DriverSuite) to pass. The tests themselves say as much.
 This means that a mvn test from a fresh clone fails.
 There's a pretty simple fix, to have Maven's test-compile phase invoke sbt 
 assembly. I suppose the only downside is re-invoking sbt assembly each 
 time tests are run.
 I'm open to ideas about how to set this up more intelligently but it would be 
 a generally good thing if the Maven build's tests passed out of the box.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1315) spark on yarn-alpha with mvn on master branch won't build


 [ 
https://issues.apache.org/jira/browse/SPARK-1315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1315:
--
Assignee: Sean Owen  (was: Sean Owen)

 spark on yarn-alpha with mvn on master branch won't build
 -

 Key: SPARK-1315
 URL: https://issues.apache.org/jira/browse/SPARK-1315
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Thomas Graves
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0


 I try to build off master branch using maven to build yarn-alpha but get the 
 following errors.
 mvn  -Dyarn.version=0.23.10 -Dhadoop.version=0.23.10  -Pyarn-alpha  clean 
 package -DskipTests 
 -
 [ERROR] 
 /home/tgraves/y-spark-git/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:25:
  object runtime i
 s not a member of package reflect [ERROR] import 
 scala.reflect.runtime.universe.runtimeMirror
 [ERROR]  ^
 [ERROR] 
 /home/tgraves/y-spark-git/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:40:
  not found: value runtimeMirror
 [ERROR]   private val mirror = runtimeMirror(classLoader)
 [ERROR]^
 [ERROR] 
 /home/tgraves/y-spark-git/tools/src/main/scala/org/apache/spark/tools/GenerateMIMAIgnore.scala:92:
  object tools is not a member of package scala
 [ERROR] scala.tools.nsc.io.File(.mima-excludes).
 [ERROR]   ^
 [ERROR] three errors found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2879) Use HTTPS to access Maven Central and other repos


 [ 
https://issues.apache.org/jira/browse/SPARK-2879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2879:
--
Assignee: Sean Owen  (was: Sean Owen)

 Use HTTPS to access Maven Central and other repos
 -

 Key: SPARK-2879
 URL: https://issues.apache.org/jira/browse/SPARK-2879
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.0.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 Maven Central has just now enabled HTTPS access for everyone to Maven Central 
 (http://central.sonatype.org/articles/2014/Aug/03/https-support-launching-now/)
  This is timely, as a reminder of how easily an attacker can slip malicious 
 code into a build that's downloading artifacts over HTTP 
 (http://blog.ontoillogical.com/blog/2014/07/28/how-to-take-over-any-java-developer/).
 In the meantime, it looks like the Spring repo also now supports HTTPS, so 
 can be used this way too.
 I propose to use HTTPS to access these repos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents


 [ 
https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-3803:
--
Assignee: Sean Owen  (was: Sean Owen)

 ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
 

 Key: SPARK-3803
 URL: https://issues.apache.org/jira/browse/SPARK-3803
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Masaru Dobashi
Assignee: Sean Owen
 Fix For: 1.2.0


 When I executed computePrincipalComponents method of RowMatrix, I got 
 java.lang.ArrayIndexOutOfBoundsException.
 {code}
 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
 RDDFunctions.scala:111
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
 (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161
 
 org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)
 
 org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)
 
 org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)
 
 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 
 scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
 scala.collection.Iterator$class.foreach(Iterator.scala:727)
 scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
 scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
 
 scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
 scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
 
 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
 
 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
 
 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 
 org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 org.apache.spark.scheduler.Task.run(Task.scala:54)
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 {code}
 The RowMatrix instance was generated from the result of TF-IDF like the 
 following.
 {code}
 scala val hashingTF = new HashingTF()
 scala val tf = hashingTF.transform(texts)
 scala import org.apache.spark.mllib.feature.IDF
 scala tf.cache()
 scala val idf = new IDF().fit(tf)
 scala val tfidf: RDD[Vector] = idf.transform(tf)
 scala import org.apache.spark.mllib.linalg.distributed.RowMatrix
 scala val mat = new RowMatrix(tfidf)
 scala val pc = mat.computePrincipalComponents(2)
 {code}
 I think this was because I created HashingTF instance with default 
 numFeatures and Array is used in RowMatrix#computeGramianMatrix method
 like the following.
 {code}
   /**
* Computes the Gramian matrix `A^T A`.
*/
   def computeGramianMatrix(): Matrix = {
 val n = numCols().toInt
 val nt: Int = n * (n + 1) / 2
 // Compute the upper triangular part of the gram matrix.
 val GU = rows.treeAggregate(new BDV[Double](new Array[Double](nt)))(
   seqOp = (U, v) = {
 RowMatrix.dspr(1.0, v, U.data)
 U
   }, combOp = (U1, U2) = U1 += U2)
 RowMatrix.triuToFull(n, GU.data)
   }
 {code} 
 When the size of Vectors generated by TF-IDF is too large, it makes nt to 
 have undesirable value (and undesirable size of Array used in treeAggregate),
 since n * (n + 1) / 2 exceeded Int.MaxValue.
 Is this surmise correct?
 And, of

[jira] [Updated] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep


 [ 
https://issues.apache.org/jira/browse/SPARK-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2749:
--
Assignee: Sean Owen  (was: Sean Owen)

 Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing 
 junit:junit dep
 ---

 Key: SPARK-2749
 URL: https://issues.apache.org/jira/browse/SPARK-2749
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 The Maven-based builds in the build matrix have been failing for a few days:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/
 On inspection, it looks like the Spark SQL Java tests don't compile:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull
 I confirmed it by repeating the command vs master:
 mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package
 The problem is that this module doesn't depend on JUnit. In fact, none of the 
 modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it 
 in, in most places. However this module doesn't depend on 
 com.novocode:junit-interface
 Adding the junit:junit dependency fixes the compile problem. In fact, the 
 other modules with Java tests should probably depend on it explicitly instead 
 of happening to get it via com.novocode:junit-interface, since that is a bit 
 SBT/Scala-specific (and I am not even sure it's needed).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1556) jets3t dep doesn't update properly with newer Hadoop versions


 [ 
https://issues.apache.org/jira/browse/SPARK-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1556:
--
Assignee: Sean Owen  (was: Sean Owen)

 jets3t dep doesn't update properly with newer Hadoop versions
 -

 Key: SPARK-1556
 URL: https://issues.apache.org/jira/browse/SPARK-1556
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.1, 0.9.0, 1.0.0
Reporter: Nan Zhu
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0


 In Hadoop 2.2.x or newer, Jet3st 0.9.0 which defines 
 S3ServiceException/ServiceException is introduced, however, Spark still 
 relies on Jet3st 0.7.x which has no definition of these classes
 What I met is that 
 [code]
 14/04/21 19:30:53 INFO deprecation: mapred.job.id is deprecated. Instead, use 
 mapreduce.job.id
 14/04/21 19:30:53 INFO deprecation: mapred.tip.id is deprecated. Instead, use 
 mapreduce.task.id
 14/04/21 19:30:53 INFO deprecation: mapred.task.id is deprecated. Instead, 
 use mapreduce.task.attempt.id
 14/04/21 19:30:53 INFO deprecation: mapred.task.is.map is deprecated. 
 Instead, use mapreduce.task.ismap
 14/04/21 19:30:53 INFO deprecation: mapred.task.partition is deprecated. 
 Instead, use mapreduce.task.partition
 java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:280)
   at 
 org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:270)
   at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2316)
   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
   at 
 org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2350)
   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2332)
   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:369)
   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
   at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221)
   at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:140)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:205)
   at org.apache.spark.SparkContext.runJob(SparkContext.scala:891)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:741)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:692)
   at 
 org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:574)
   at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:900)
   at $iwC$$iwC$$iwC$$iwC.init(console:15)
   at $iwC$$iwC$$iwC.init(console:20)
   at $iwC$$iwC.init(console:22)
   at $iwC.init(console:24)
   at init(console:26)
   at .init(console:30)
   at .clinit(console)
   at .init(console:7)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:772)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1040)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:609)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:640)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:604)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:793)
   at

[jira] [Updated] (SPARK-1071) Tidy logging strategy and use of log4j


 [ 
https://issues.apache.org/jira/browse/SPARK-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1071:
--
Reporter: Sean Owen  (was: Sean Owen)

 Tidy logging strategy and use of log4j
 --

 Key: SPARK-1071
 URL: https://issues.apache.org/jira/browse/SPARK-1071
 Project: Spark
  Issue Type: Improvement
  Components: Build, Input/Output
Affects Versions: 0.9.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.0.0


 Prompted by a recent thread on the mailing list, I tried and failed to see if 
 Spark can be made independent of log4j. There are a few cases where control 
 of the underlying logging is pretty useful, and to do that, you have to bind 
 to a specific logger. 
 Instead I propose some tidying that leaves Spark's use of log4j, but gets rid 
 of warnings and should still enable downstream users to switch. The idea is 
 to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J 
 directly when logging, and where Spark needs to output info (REPL and tests), 
 bind from SLF4J to log4j.
 This leaves the same behavior in Spark. It means that downstream users who 
 want to use something except log4j should:
 - Exclude dependencies on log4j, slf4j-log4j12 from Spark
 - Include dependency on log4j-over-slf4j
 - Include dependency on another logger X, and another slf4j-X
 - Recreate any log config that Spark does, that is needed, in the other 
 logger's config
 That sounds about right.
 Here are the key changes: 
 - Include the jcl-over-slf4j shim everywhere by depending on it in core.
 - Exclude dependencies on commons-logging from third-party libraries.
 - Include the jul-to-slf4j shim everywhere by depending on it in core.
 - Exclude slf4j-* dependencies from third-party libraries to prevent 
 collision or warnings
 - Added missing slf4j-log4j12 binding to GraphX, Bagel module tests
 And minor/incidental changes:
 - Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a 
 recommended update over 1.7.2
 - (Remove a duplicate HBase dependency declaration in SparkBuild.scala)
 - (Remove a duplicate mockito dependency declaration that was causing 
 warnings and bugging me)
 Pull request coming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1254) Consolidate, order, and harmonize repository declarations in Maven/SBT builds


 [ 
https://issues.apache.org/jira/browse/SPARK-1254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1254:
--
Reporter: Sean Owen  (was: Sean Owen)

 Consolidate, order, and harmonize repository declarations in Maven/SBT builds
 -

 Key: SPARK-1254
 URL: https://issues.apache.org/jira/browse/SPARK-1254
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 0.9.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.0.0


 This suggestion addresses a few minor suboptimalities with how repositories 
 are handled.
 1) Use HTTPS consistently to access repos, instead of HTTP
 2) Consolidate repository declarations in the parent POM file, in the case of 
 the Maven build, so that their ordering can be controlled to put the fully 
 optional Cloudera repo at the end, after required repos. (This was prompted 
 by the untimely failure of the Cloudera repo this week, which made the Spark 
 build fail. #2 would have prevented that.)
 3) Update SBT build to match Maven build in this regard
 4) Update SBT build to *not* refer to Sonatype snapshot repos. This wasn't in 
 Maven, and a build generally would not refer to external snapshots, but I'm 
 not 100% sure on this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1335) Also increase perm gen / code cache for scalatest when invoked via Maven build


 [ 
https://issues.apache.org/jira/browse/SPARK-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1335:
--
Reporter: Sean Owen  (was: Sean Owen)

 Also increase perm gen / code cache for scalatest when invoked via Maven build
 --

 Key: SPARK-1335
 URL: https://issues.apache.org/jira/browse/SPARK-1335
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.0
Reporter: Sean Owen
Assignee: Sean Owen
 Fix For: 1.0.0


 I am observing build failures when the Maven build reaches tests in the new 
 SQL components. (I'm on Java 7 / OSX 10.9). The failure is the usual 
 complaint from scala, that it's out of permgen space, or that JIT out of code 
 cache space.
 I see that various build scripts increase these both for SBT. This change 
 simply adds these settings to scalatest's arguments. Works for me and seems a 
 bit more consistent.
 (In the PR I'm going to tack on some other little changes too -- see PR.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1316) Remove use of Commons IO


 [ 
https://issues.apache.org/jira/browse/SPARK-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1316:
--
Reporter: Sean Owen  (was: Sean Owen)

 Remove use of Commons IO
 

 Key: SPARK-1316
 URL: https://issues.apache.org/jira/browse/SPARK-1316
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 0.9.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 (This follows from a side point on SPARK-1133, in discussion of the PR: 
 https://github.com/apache/spark/pull/164 )
 Commons IO is barely used in the project, and can easily be replaced with 
 equivalent calls to Guava or the existing Spark Utils.scala class.
 Removing a dependency feels good, and this one in particular can get a little 
 problematic since Hadoop uses it too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets


 [ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2341:
--
Assignee: Sean Owen  (was: Sean Owen)

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Assignee: Sean Owen
Priority: Minor
  Labels: easyfix
 Fix For: 1.1.0


 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1071) Tidy logging strategy and use of log4j


 [ 
https://issues.apache.org/jira/browse/SPARK-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1071:
--
Assignee: Sean Owen  (was: Sean Owen)

 Tidy logging strategy and use of log4j
 --

 Key: SPARK-1071
 URL: https://issues.apache.org/jira/browse/SPARK-1071
 Project: Spark
  Issue Type: Improvement
  Components: Build, Input/Output
Affects Versions: 0.9.0
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.0.0


 Prompted by a recent thread on the mailing list, I tried and failed to see if 
 Spark can be made independent of log4j. There are a few cases where control 
 of the underlying logging is pretty useful, and to do that, you have to bind 
 to a specific logger. 
 Instead I propose some tidying that leaves Spark's use of log4j, but gets rid 
 of warnings and should still enable downstream users to switch. The idea is 
 to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J 
 directly when logging, and where Spark needs to output info (REPL and tests), 
 bind from SLF4J to log4j.
 This leaves the same behavior in Spark. It means that downstream users who 
 want to use something except log4j should:
 - Exclude dependencies on log4j, slf4j-log4j12 from Spark
 - Include dependency on log4j-over-slf4j
 - Include dependency on another logger X, and another slf4j-X
 - Recreate any log config that Spark does, that is needed, in the other 
 logger's config
 That sounds about right.
 Here are the key changes: 
 - Include the jcl-over-slf4j shim everywhere by depending on it in core.
 - Exclude dependencies on commons-logging from third-party libraries.
 - Include the jul-to-slf4j shim everywhere by depending on it in core.
 - Exclude slf4j-* dependencies from third-party libraries to prevent 
 collision or warnings
 - Added missing slf4j-log4j12 binding to GraphX, Bagel module tests
 And minor/incidental changes:
 - Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a 
 recommended update over 1.7.2
 - (Remove a duplicate HBase dependency declaration in SparkBuild.scala)
 - (Remove a duplicate mockito dependency declaration that was causing 
 warnings and bugging me)
 Pull request coming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2798) Correct several small errors in Flume module pom.xml files


 [ 
https://issues.apache.org/jira/browse/SPARK-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2798:
--
Assignee: Sean Owen  (was: Sean Owen)

 Correct several small errors in Flume module pom.xml files
 --

 Key: SPARK-2798
 URL: https://issues.apache.org/jira/browse/SPARK-2798
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 (EDIT) Since the scalatest issue was since resolved, this is now about a few 
 small problems in the Flume Sink pom.xml 
 - scalatest is not declared as a test-scope dependency
 - Its Avro version doesn't match the rest of the build
 - Its Flume version is not synced with the other Flume module
 - The other Flume module declares its dependency on Flume Sink slightly 
 incorrectly, hard-coding the Scala 2.10 version
 - It depends on Scala Lang directly, which it shouldn't



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5263) `create table` DDL need to check if table exists first

2015-01-15 Thread shengli (JIRA)

shengli created SPARK-5263:
--

 Summary: `create table` DDL  need to check if table exists first
 Key: SPARK-5263
 URL: https://issues.apache.org/jira/browse/SPARK-5263
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: shengli
Priority: Minor
 Fix For: 1.3.0


`create table` DDL  need to check if table exists first



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5246) spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve

2015-01-15 Thread Vladimir Grigor (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Grigor updated SPARK-5246:
---
Description: 
##How to reproduce: 
1)  http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html 
should be sufficient to setup VPC for this bug. After you followed that guide, 
start new instance in VPC, ssh to it (though NAT server)

2) user starts a cluster in VPC:
{code}
./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
--spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
--subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
Setting up security groups...

(omitted for brevity)
10.1.1.62
10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop
no org.apache.spark.deploy.master.Master to stop
starting org.apache.spark.deploy.master.Master, logging to 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
failed to launch org.apache.spark.deploy.master.Master:
at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
... 12 more
full log in 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
10.1.1.62: starting org.apache.spark.deploy.worker.Worker, logging to 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
10.1.1.62: failed to launch org.apache.spark.deploy.worker.Worker:
10.1.1.62:  at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
10.1.1.62:  ... 12 more
10.1.1.62: full log in 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
[timing] spark-standalone setup:  00h 00m 28s
 
(omitted for brevity)
{code}

/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
{code}
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp 
:::/root/ephemeral-hdfs/conf:/root/spark/sbin/../conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.master.Master --ip 10.1.1.151 --port 7077 --webui-port 
8080


15/01/14 07:34:47 INFO master.Master: Registered signal handlers for [TERM, 
HUP, INT]
Exception in thread main java.net.UnknownHostException: ip-10-1-1-151: 
ip-10-1-1-151: Name or service not known
at java.net.InetAddress.getLocalHost(InetAddress.java:1473)
at org.apache.spark.util.Utils$.findLocalIpAddress(Utils.scala:620)
at 
org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:612)
at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:612)
at 
org.apache.spark.util.Utils$.localIpAddressHostname$lzycompute(Utils.scala:613)
at org.apache.spark.util.Utils$.localIpAddressHostname(Utils.scala:613)
at 
org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
at 
org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.util.Utils$.localHostName(Utils.scala:665)
at 
org.apache.spark.deploy.master.MasterArguments.init(MasterArguments.scala:27)
at org.apache.spark.deploy.master.Master$.main(Master.scala:819)
at org.apache.spark.deploy.master.Master.main(Master.scala)
Caused by: java.net.UnknownHostException: ip-10-1-1-151: Name or service not 
known
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
at 
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
... 12 more
{code}

Problem is that instance launched in VPC may be not able to resolve own local 
hostname. Please see  https://forums.aws.amazon.com/thread.jspa?threadID=92092.

I am going to submit a fix for this problem since I need this functionality 
asap.


## How to reproduce

  was:
How to reproduce: 
1) user starts a cluster in VPC:
{code}
./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
--spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
--subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
Setting up security groups...

(omitted for brevity)
10.1.1.62
10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop
no org.apache.spark.deploy.master.Master to stop
starting org.apache.spark.deploy.master.Master, logging to 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
failed to launch org.apache.spark.deploy.master.Master:
at

[jira] [Updated] (SPARK-5246) spark/spark-ec2.py cannot start Spark master in VPC if local DNS name does not resolve

2015-01-15 Thread Vladimir Grigor (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Grigor updated SPARK-5246:
---
Description: 
How to reproduce: 

1)  http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html 
should be sufficient to setup VPC for this bug. After you followed that guide, 
start new instance in VPC, ssh to it (though NAT server)

2) user starts a cluster in VPC:
{code}
./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
--spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
--subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
Setting up security groups...

(omitted for brevity)
10.1.1.62
10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop
no org.apache.spark.deploy.master.Master to stop
starting org.apache.spark.deploy.master.Master, logging to 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
failed to launch org.apache.spark.deploy.master.Master:
at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
... 12 more
full log in 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
10.1.1.62: starting org.apache.spark.deploy.worker.Worker, logging to 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
10.1.1.62: failed to launch org.apache.spark.deploy.worker.Worker:
10.1.1.62:  at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
10.1.1.62:  ... 12 more
10.1.1.62: full log in 
/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.worker.Worker-1-ip-10-1-1-62.out
[timing] spark-standalone setup:  00h 00m 28s
 
(omitted for brevity)
{code}

/root/spark/sbin/../logs/spark-root-org.apache.spark.deploy.master.Master-1-.out
{code}
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp 
:::/root/ephemeral-hdfs/conf:/root/spark/sbin/../conf:/root/spark/lib/spark-assembly-1.2.0-hadoop1.0.4.jar:/root/spark/lib/datanucleus-api-jdo-3.2.6.jar:/root/spark/lib/datanucleus-rdbms-3.2.9.jar:/root/spark/lib/datanucleus-core-3.2.10.jar
 -XX:MaxPermSize=128m -Dspark.akka.logLifecycleEvents=true -Xms512m -Xmx512m 
org.apache.spark.deploy.master.Master --ip 10.1.1.151 --port 7077 --webui-port 
8080


15/01/14 07:34:47 INFO master.Master: Registered signal handlers for [TERM, 
HUP, INT]
Exception in thread main java.net.UnknownHostException: ip-10-1-1-151: 
ip-10-1-1-151: Name or service not known
at java.net.InetAddress.getLocalHost(InetAddress.java:1473)
at org.apache.spark.util.Utils$.findLocalIpAddress(Utils.scala:620)
at 
org.apache.spark.util.Utils$.localIpAddress$lzycompute(Utils.scala:612)
at org.apache.spark.util.Utils$.localIpAddress(Utils.scala:612)
at 
org.apache.spark.util.Utils$.localIpAddressHostname$lzycompute(Utils.scala:613)
at org.apache.spark.util.Utils$.localIpAddressHostname(Utils.scala:613)
at 
org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
at 
org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:665)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.util.Utils$.localHostName(Utils.scala:665)
at 
org.apache.spark.deploy.master.MasterArguments.init(MasterArguments.scala:27)
at org.apache.spark.deploy.master.Master$.main(Master.scala:819)
at org.apache.spark.deploy.master.Master.main(Master.scala)
Caused by: java.net.UnknownHostException: ip-10-1-1-151: Name or service not 
known
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
at 
java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
... 12 more
{code}

Problem is that instance launched in VPC may be not able to resolve own local 
hostname. Please see  https://forums.aws.amazon.com/thread.jspa?threadID=92092.

I am going to submit a fix for this problem since I need this functionality 
asap.


  was:
##How to reproduce: 
1)  http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_Scenario2.html 
should be sufficient to setup VPC for this bug. After you followed that guide, 
start new instance in VPC, ssh to it (though NAT server)

2) user starts a cluster in VPC:
{code}
./spark-ec2 -k key20141114 -i ~/aws/key.pem -s 1 --region=eu-west-1 
--spark-version=1.2.0 --instance-type=m1.large --vpc-id=vpc-2e71dd46 
--subnet-id=subnet-2571dd4d --zone=eu-west-1a  launch SparkByScript
Setting up security groups...

(omitted for brevity)
10.1.1.62
10.1.1.62: no org.apache.spark.deploy.worker.Worker to stop
no org.apache.spark.deploy.master.Master to stop
starting

[jira] [Created] (SPARK-5264) support `drop table` DDL command

2015-01-15 Thread shengli (JIRA)

shengli created SPARK-5264:
--

 Summary: support `drop table` DDL command 
 Key: SPARK-5264
 URL: https://issues.apache.org/jira/browse/SPARK-5264
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: shengli
Priority: Minor
 Fix For: 1.3.0


support `drop table` DDL command 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5263) `create table` DDL need to check if table exists first


[ 
https://issues.apache.org/jira/browse/SPARK-5263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278476#comment-14278476
 ] 

Apache Spark commented on SPARK-5263:
-

User 'OopsOutOfMemory' has created a pull request for this issue:
https://github.com/apache/spark/pull/4058

 `create table` DDL  need to check if table exists first
 ---

 Key: SPARK-5263
 URL: https://issues.apache.org/jira/browse/SPARK-5263
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: shengli
Priority: Minor
 Fix For: 1.3.0

   Original Estimate: 72h
  Remaining Estimate: 72h

 `create table` DDL  need to check if table exists first



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5243) Spark will hang if (driver memory + executor memory) exceeds limit on a 1-worker cluster

2015-01-15 Thread Takumi Yoshida (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278502#comment-14278502
]

Takumi Yoshida commented on SPARK-5243:
---

Hi!

I found, Spark hangs with following situation. i guess there would be some
other condition.

1. the cluster has only one worker.
yes, running standalone.

2. driver memory + executor memory worker memory
I use following settings, but it hang.

driver memory = 1g
executor memory = 1g
worker memory = 3g

3. deploy-mode = cluster
no, deploy-mode was client as default.

I use follwing code.
https://gist.github.com/yoshi0309/33bd912d91c0bb5cdf30

command.
./bin/spark-submit ./ldgourmetALS.py s3n://abc-takumiyoshida/datasets/
--driver-memory 1g

machine.
Amazon EC2 / m3.medium (3ECU and 3.75GB RAM)

Spark will hang if (driver memory + executor memory) exceeds limit on a
1-worker cluster

Key: SPARK-5243
URL: https://issues.apache.org/jira/browse/SPARK-5243
Project: Spark
Issue Type: Improvement
Components: Deploy
Affects Versions: 1.2.0
Environment: centos, others should be similar
Reporter: yuhao yang
Priority: Minor

Spark will hang if calling spark-submit under the conditions:
1. the cluster has only one worker.
2. driver memory + executor memory worker memory
3. deploy-mode = cluster
This usually happens during development for beginners.
There should be some exit mechanism or at least a warning message in the
output of the spark-submit.
I am preparing PR for the case. And I would like to know your opinions about
if a fix is needed and better fix options.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1727) Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs


 [ 
https://issues.apache.org/jira/browse/SPARK-1727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1727:
--
Assignee: Sean Owen  (was: Sean Owen)

 Correct small compile errors, typos, and markdown issues in (primarly) MLlib 
 docs
 -

 Key: SPARK-1727
 URL: https://issues.apache.org/jira/browse/SPARK-1727
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 0.9.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.0.0


 While play-testing the Scala and Java code examples in the MLlib docs, I 
 noticed a number of small compile errors, and some typos. This led to finding 
 and fixing a few similar items in other docs. 
 Then in the course of building the site docs to check the result, I found a 
 few small suggestions for the build instructions. I also found a few more 
 formatting and markdown issues uncovered when I accidentally used maruku 
 instead of kramdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1789) Multiple versions of Netty dependencies cause FlumeStreamSuite failure


 [ 
https://issues.apache.org/jira/browse/SPARK-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1789:
--
Assignee: Sean Owen  (was: Sean Owen)

 Multiple versions of Netty dependencies cause FlumeStreamSuite failure
 --

 Key: SPARK-1789
 URL: https://issues.apache.org/jira/browse/SPARK-1789
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.9.1
Reporter: Sean Owen
Assignee: Sean Owen
  Labels: flume, netty, test
 Fix For: 1.0.0


 TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly 
 resolved and will resolve a test failure.
 I hit the error described at 
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html
  while running FlumeStreamingSuite, and have for a short while (is it just 
 me?)
 velvia notes:
 I have found a workaround.  If you add akka 2.2.4 to your dependencies, then 
 everything works, probably because akka 2.2.4 brings in newer version of 
 Jetty. 
 There are at least 3 versions of Netty in play in the build:
 - the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and 
 that is the immediate problem
 - the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6.
 - but, Spark Core directly uses io.netty:netty-all:4.0.17.Final
 The POMs try to exclude other versions of netty, but are excluding 
 org.jboss.netty:netty, when in fact older versions of io.netty:netty (not 
 netty-all) are also an issue.
 The org.jboss.netty:netty excludes are largely unnecessary. I replaced many 
 of them with io.netty:netty exclusions until everything agreed on 
 io.netty:netty-all:4.0.17.Final.
 But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. 
 Down-grading to 3.6.6.Final across the board made some Spark code not compile.
 If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to 
 work. Part of the reason seems to be that Netty 3.x used the old 
 `org.jboss.netty` packages. This is less than ideal, but is no worse than the 
 current situation. 
 So this PR resolves the issue and improves the JAR hell, even if it leaves 
 the existing theoretical Netty 3-vs-4 conflict:
 - Remove org.jboss.netty excludes where possible, for clarity; they're not 
 needed except with Hadoop artifacts
 - Add io.netty:netty excludes where needed -- except, let akka keep its 
 io.netty:netty
 - Change a bit of test code that actually depended on Netty 3.x, to use 4.x 
 equivalent
 - Update SBT build accordingly
 A better change would be to update Akka far enough such that it agrees on 
 Netty 4.x, but I don't know if that's feasible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1802) Audit dependency graph when Spark is built with -Phive


 [ 
https://issues.apache.org/jira/browse/SPARK-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1802:
--
Assignee: Sean Owen  (was: Sean Owen)

 Audit dependency graph when Spark is built with -Phive
 --

 Key: SPARK-1802
 URL: https://issues.apache.org/jira/browse/SPARK-1802
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sean Owen
Priority: Blocker
 Fix For: 1.0.0

 Attachments: hive-exec-jar-problems.txt


 I'd like to have binary release for 1.0 include Hive support. Since this 
 isn't enabled by default in the build I don't think it's as well tested, so 
 we should dig around a bit and decide if we need to e.g. add any excludes.
 {code}
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  without_hive.txt
 $ mvn install -Phive -DskipTests  mvn dependency:build-classpath -Phive -pl 
 assembly | grep -v INFO | tr : \n |  awk ' { FS=/; print ( $(NF) ); }' 
 | sort  with_hive.txt
 $ diff without_hive.txt with_hive.txt
  antlr-2.7.7.jar
  antlr-3.4.jar
  antlr-runtime-3.4.jar
 10,14d6
  avro-1.7.4.jar
  avro-ipc-1.7.4.jar
  avro-ipc-1.7.4-tests.jar
  avro-mapred-1.7.4.jar
  bonecp-0.7.1.RELEASE.jar
 22d13
  commons-cli-1.2.jar
 25d15
  commons-compress-1.4.1.jar
 33,34d22
  commons-logging-1.1.1.jar
  commons-logging-api-1.0.4.jar
 38d25
  commons-pool-1.5.4.jar
 46,49d32
  datanucleus-api-jdo-3.2.1.jar
  datanucleus-core-3.2.2.jar
  datanucleus-rdbms-3.2.1.jar
  derby-10.4.2.0.jar
 53,57d35
  hive-common-0.12.0.jar
  hive-exec-0.12.0.jar
  hive-metastore-0.12.0.jar
  hive-serde-0.12.0.jar
  hive-shims-0.12.0.jar
 60,61d37
  httpclient-4.1.3.jar
  httpcore-4.1.3.jar
 68d43
  JavaEWAH-0.3.2.jar
 73d47
  javolution-5.5.1.jar
 76d49
  jdo-api-3.0.1.jar
 78d50
  jetty-6.1.26.jar
 87d58
  jetty-util-6.1.26.jar
 93d63
  json-20090211.jar
 98d67
  jta-1.1.jar
 103,104d71
  libfb303-0.9.0.jar
  libthrift-0.9.0.jar
 112d78
  mockito-all-1.8.5.jar
 136d101
  servlet-api-2.5-20081211.jar
 139d103
  snappy-0.2.jar
 144d107
  spark-hive_2.10-1.0.0.jar
 151d113
  ST4-4.0.4.jar
 153d114
  stringtemplate-3.2.1.jar
 156d116
  velocity-1.7.jar
 158d117
  xz-1.0.jar
 {code}
 Some initial investigation suggests we may need to take some precaution 
 surrounding (a) jetty and (b) servlet-api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1248) Spark build error with Apache Hadoop(Cloudera CDH4)


 [ 
https://issues.apache.org/jira/browse/SPARK-1248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1248:
--
Assignee: Sean Owen  (was: Sean Owen)

 Spark build error with Apache Hadoop(Cloudera CDH4)
 ---

 Key: SPARK-1248
 URL: https://issues.apache.org/jira/browse/SPARK-1248
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Guoqiang Li
Assignee: Sean Owen
 Fix For: 1.0.0


 {code}
 SPARK_HADOOP_VERSION=2.0.0-cdh4.5.0 SPARK_YARN=true sbt/sbt assembly -d  
 error.log
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1120) Send all dependency logging through slf4j


 [ 
https://issues.apache.org/jira/browse/SPARK-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-1120:
--
Assignee: Sean Owen  (was: Sean Owen)

 Send all dependency logging through slf4j
 -

 Key: SPARK-1120
 URL: https://issues.apache.org/jira/browse/SPARK-1120
 Project: Spark
  Issue Type: Improvement
Reporter: Patrick Cogan
Assignee: Sean Owen
 Fix For: 1.0.0


 There are a few dependencies that pull in other logging frameworks which 
 don't get routed correctly. We should include the relevant slf4j adapters and 
 exclude those logging libraries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2363) Clean MLlib's sample data files


 [ 
https://issues.apache.org/jira/browse/SPARK-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tony Stevenson updated SPARK-2363:
--
Assignee: Sean Owen  (was: Sean Owen)

 Clean MLlib's sample data files
 ---

 Key: SPARK-2363
 URL: https://issues.apache.org/jira/browse/SPARK-2363
 Project: Spark
  Issue Type: Task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.1.0


 MLlib has sample data under serveral folders:
 1) data/mllib
 2) data/
 3) mllib/data/*
 Per previous discussion with [~matei], we want to put them under `data/mllib` 
 and clean outdated files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1254) Consolidate, order, and harmonize repository declarations in Maven/SBT builds