date:20141107

[
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201765#comment-14201765
]

zzc commented on SPARK-2468:

@Lianhui Wang, How to view the associated logs with yarn still kill
executor's container because it's physical memory beyond allocated memory. I
can't find it.

Netty-based block server / client module

Key: SPARK-2468
URL: https://issues.apache.org/jira/browse/SPARK-2468
Project: Spark
Issue Type: Improvement
Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
Fix For: 1.2.0

Right now shuffle send goes through the block manager. This is inefficient
because it requires loading a block from disk into a kernel buffer, then into
a user space buffer, and then back to a kernel send buffer before it reaches
the NIC. It does multiple copies of the data and context switching between
kernel/user. It also creates unnecessary buffer in the JVM that increases GC
Instead, we should use FileChannel.transferTo, which handles this in the
kernel space with zero-copy. See
http://www.ibm.com/developerworks/library/j-zerocopy/
One potential solution is to use Netty. Spark already has a Netty based
network module implemented (org.apache.spark.network.netty). However, it
lacks some functionality and is turned off by default.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

[
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201766#comment-14201766
]

Aaron Davidson commented on SPARK-2468:
---

[~zzcclp] Yes, please do. What's the memory of your YARN executors/containers?
With preferDirectBufs off, we should allocate little to no off-heap memory, so
these results are surprising.

Netty-based block server / client module

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module


[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201772#comment-14201772
 ] 

zzc commented on SPARK-2468:


aa...@databricks.com?

 Netty-based block server / client module
 

 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.2.0


 Right now shuffle send goes through the block manager. This is inefficient 
 because it requires loading a block from disk into a kernel buffer, then into 
 a user space buffer, and then back to a kernel send buffer before it reaches 
 the NIC. It does multiple copies of the data and context switching between 
 kernel/user. It also creates unnecessary buffer in the JVM that increases GC
 Instead, we should use FileChannel.transferTo, which handles this in the 
 kernel space with zero-copy. See 
 http://www.ibm.com/developerworks/library/j-zerocopy/
 One potential solution is to use Netty.  Spark already has a Netty based 
 network module implemented (org.apache.spark.network.netty). However, it 
 lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module


[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201774#comment-14201774
 ] 

Aaron Davidson commented on SPARK-2468:
---

Yup, that would work.

 Netty-based block server / client module
 

 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.2.0


 Right now shuffle send goes through the block manager. This is inefficient 
 because it requires loading a block from disk into a kernel buffer, then into 
 a user space buffer, and then back to a kernel send buffer before it reaches 
 the NIC. It does multiple copies of the data and context switching between 
 kernel/user. It also creates unnecessary buffer in the JVM that increases GC
 Instead, we should use FileChannel.transferTo, which handles this in the 
 kernel space with zero-copy. See 
 http://www.ibm.com/developerworks/library/j-zerocopy/
 One potential solution is to use Netty.  Spark already has a Netty based 
 network module implemented (org.apache.spark.network.netty). However, it 
 lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-11-07 Thread Lianhui Wang (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201778#comment-14201778
]

Lianhui Wang commented on SPARK-2468:
-

[~zzcclp] in am's log, you can find this log:
Exit status: 143. Diagnostics: Container[container-id]is running beyond
physical memory limits. Current usage: 8.3 GB of 8 GB physical memory used;
11.0 GB of 16.8 GB virtual memory used. Killing container.
and i already set spark.yarn.executor.memoryOverhead=1024 and executor's memory
is 7G.
so through above log, i can confirm that executor use big no-heap jvm memory.

Netty-based block server / client module

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

[
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201783#comment-14201783
]

Aaron Davidson commented on SPARK-2468:
---

Thanks a lot for those diagnostics. Can you confirm that
spark.shuffle.io.preferDirectBufs does show up in the UI as being set
properly? Does your workload mainly involve a large shuffle? How big is each
partition/how many are there? In addition to the netty buffers (which _should_
be disabled by the config), we also memory map shuffle blocks larger than 2MB.

Netty-based block server / client module

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4295) [External]Exception throws in SparkSinkSuite although all test cases pass

2014-11-07 Thread maji2014 (JIRA)

maji2014 created SPARK-4295:
---

 Summary: [External]Exception throws in SparkSinkSuite although all 
test cases pass
 Key: SPARK-4295
 URL: https://issues.apache.org/jira/browse/SPARK-4295
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: maji2014
Priority: Minor


After the first test case, all other test cases throw 
javax.management.InstanceAlreadyExistsException: 
org.apache.flume.channel:type=null , exception as followings:

14/11/07 00:24:51 ERROR MonitoredCounterGroup: Failed to register monitored 
counter group for type: CHANNEL, name: null
javax.management.InstanceAlreadyExistsException: 
org.apache.flume.channel:type=null
at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324)
at 
com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
at 
org.apache.flume.instrumentation.MonitoredCounterGroup.register(MonitoredCounterGroup.java:108)
at 
org.apache.flume.instrumentation.MonitoredCounterGroup.start(MonitoredCounterGroup.java:88)
at org.apache.flume.channel.MemoryChannel.start(MemoryChannel.java:345)
at 
org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$2.apply$mcV$sp(SparkSinkSuite.scala:63)
at 
org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$2.apply(SparkSinkSuite.scala:61)
at 
org.apache.spark.streaming.flume.sink.SparkSinkSuite$$anonfun$2.apply(SparkSinkSuite.scala:61)
at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
at org.scalatest.Suite$class.run(Suite.scala:1424)
at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
at org.scalatest.FunSuite.run(FunSuite.scala:1555)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:55)
at 
org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2563)
at 
org.scalatest.tools.Runner$$anonfun$doRunRunRunDaDoRunRun$3.apply(Runner.scala:2557)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:2557)
at 
org.scalatest.tools.Runner$$anonfun$runOptionallyWithPassFailReporter$2.apply(Runner.scala:1044)
at

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

[
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201812#comment-14201812
]

Aaron Davidson commented on SPARK-2468:
---

Looking at the netty code a bit more, it seems that they might unconditionally
allocate direct buffers for IO, whether or not direct is preferred.
Additionally, they allocate more memory based on the number of cores in your
system. The default settings would be roughly 16MB per core, and this might be
multiplied by 2 in our current setup since we have independent client and
server pools in the same JVM. I'm not certain how executors running in YARN
report availableProcessors, but is it possible your machines have 32 or
greater cores? This could cause an extra allocation of around 1GB direct heap
buffers.

Netty-based block server / client module

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.


[ 
https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201827#comment-14201827
 ] 

Sean Owen commented on SPARK-4289:
--

This is a Hadoop issue, right? I don't know if Spark can address this directly 
I suppose you could work around this with :silent in the shell.

 Creating an instance of Hadoop Job fails in the Spark shell when toString() 
 is called on the instance.
 --

 Key: SPARK-4289
 URL: https://issues.apache.org/jira/browse/SPARK-4289
 Project: Spark
  Issue Type: Bug
Reporter: Corey J. Nolet

 This one is easy to reproduce.
 preval job = new Job(sc.hadoopConfiguration)/pre
 I'm not sure what the solution would be off hand as it's happening when the 
 shell is calling toString() on the instance of Job. The problem is, because 
 of the failure, the instance is never actually assigned to the job val.
 java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
   at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
   at org.apache.hadoop.mapreduce.Job.toString(Job.java:452)
   at 
 scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
   at .init(console:10)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4288) Add Sparse Autoencoder algorithm to MLlib


 [ 
https://issues.apache.org/jira/browse/SPARK-4288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4288:
-
 Description: Are you proposing an implementation? Is it related to the 
neural network JIRA?
Target Version/s:   (was: 1.3.0)
  Issue Type: Wish  (was: Bug)

 Add Sparse Autoencoder algorithm to MLlib 
 --

 Key: SPARK-4288
 URL: https://issues.apache.org/jira/browse/SPARK-4288
 Project: Spark
  Issue Type: Wish
  Components: MLlib
Reporter: Guoqiang Li
  Labels: features

 Are you proposing an implementation? Is it related to the neural network JIRA?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module


[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201832#comment-14201832
 ] 

Aaron Davidson commented on SPARK-2468:
---

[~lianhuiwang] I have created 
[#3155|https://github.com/apache/spark/pull/3155/files], which I will clean up 
and try to get in tomorrow, which makes the preferDirectBufs config forcefully 
disable direct byte buffers from both the server and client pools. 
Additionally, I have added the conf spark.shuffle.io.maxUsableCores which 
should allow you to inform the executor how many cores you're actually using, 
so it will avoid allocating enough memory for all the machine's cores. 

I hope that simply specifying the maxUsableCores is sufficient to actually fix 
this issue for you, but the combination should give a higher chance of success.

 Netty-based block server / client module
 

 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.2.0


 Right now shuffle send goes through the block manager. This is inefficient 
 because it requires loading a block from disk into a kernel buffer, then into 
 a user space buffer, and then back to a kernel send buffer before it reaches 
 the NIC. It does multiple copies of the data and context switching between 
 kernel/user. It also creates unnecessary buffer in the JVM that increases GC
 Instead, we should use FileChannel.transferTo, which handles this in the 
 kernel space with zero-copy. See 
 http://www.ibm.com/developerworks/library/j-zerocopy/
 One potential solution is to use Netty.  Spark already has a Netty based 
 network module implemented (org.apache.spark.network.netty). However, it 
 lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module


[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201844#comment-14201844
 ] 

zzc commented on SPARK-2468:


By the way, My test code: 
val mapR = textFile.map(line = {
..
((value(1) + _ + date.toString(), url), (flow, 1))
}).reduceByKey((pair1, pair2) = {
(pair1._1 + pair2._1, pair1._2 + pair2._2)
}, 100)

mapR.persist(StorageLevel.MEMORY_AND_DISK_SER)

val mapR1 = mapR.groupBy(_._1._1)
.mapValues(pairs = {
pairs.toList.sortBy(_._2._1).reverse
})
.flatMap(values = {
values._2
})
.map(values = {
values._1._1 + \t + values._1._2 + \t + 
values._2._1.toString() + \t + values._2._2.toString()
})
.saveAsTextFile(outputPath + _1/)

val mapR2 = mapR.groupBy(_._1._1)
.mapValues(pairs = {
pairs.toList.sortBy(_._2._2).reverse
})
.flatMap(values = {
values._2
})
.map(values = {
values._1._1 + \t + values._1._2 + \t + 
values._2._1.toString() + \t + values._2._2.toString()
})
.saveAsTextFile(outputPath + _2/)


 Netty-based block server / client module
 

 Key: SPARK-2468
 URL: https://issues.apache.org/jira/browse/SPARK-2468
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.2.0


 Right now shuffle send goes through the block manager. This is inefficient 
 because it requires loading a block from disk into a kernel buffer, then into 
 a user space buffer, and then back to a kernel send buffer before it reaches 
 the NIC. It does multiple copies of the data and context switching between 
 kernel/user. It also creates unnecessary buffer in the JVM that increases GC
 Instead, we should use FileChannel.transferTo, which handles this in the 
 kernel space with zero-copy. See 
 http://www.ibm.com/developerworks/library/j-zerocopy/
 One potential solution is to use Netty.  Spark already has a Netty based 
 network module implemented (org.apache.spark.network.netty). However, it 
 lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

[
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201849#comment-14201849
]

Aaron Davidson commented on SPARK-2468:
---

[~zzcclp] Thank you for the writeup. Is it really the case that each of your
executors is only using 1 core for its 20GB of RAM? It seems like 5 would be in
line with the portion of memory you're using. Also, the sum of your storage and
memory fractions exceed 1, so if you're caching any data and then performing a
reduction/groupBy, you could actually see an OOM even without this other issue.
I would recommend keeping shuffle fraction relatively low unless you have a
good reason not to, as it can lead to increased instability.

The numbers are relatively close to my expectations, which would estimate netty
allocating around 750MB of direct buffer space, thinking that it has 24 cores.
With #3155 and maxUsableCores set to 1 (or 5), I hope this issue may be
resolved.

Netty-based block server / client module

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4275) ./sbt/sbt assembly command fails if path has space in the name

2014-11-07 Thread jiezhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201851#comment-14201851
 ] 

jiezhou commented on SPARK-4275:


I run the ./sbt/sbt assembly on my mac, the error msg is as follows:

usage: dirname path
./sbt/sbt: line 31: /sbt-launch-lib.bash: No such file or directory
./sbt/sbt: line 111: run: command not found

obviously the path including space hinder the execution of dirname.

 ./sbt/sbt assembly command fails if path has space in the name
 

 Key: SPARK-4275
 URL: https://issues.apache.org/jira/browse/SPARK-4275
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Ravi Kiran
Priority: Trivial

 I have downloaded branch-1.1 for building spark from scratch on my MAC. The 
 path had a space like
 /Users/rkgurram/VirtualBox VMs/SPARK/spark-branch-1.1,
 1) I cd to the above directory 
 2) Ran ./sbt/sbt assembly 
 The command fails with weird messages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4283) Spark source code does not correctly import into eclipse


[ 
https://issues.apache.org/jira/browse/SPARK-4283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201853#comment-14201853
 ] 

Sean Owen commented on SPARK-4283:
--

This is really an Eclipse problem. I don't personally think it's worth the 
extra weight in the build for this. (Use pull requests, not patches on JIRAs, 
in Spark.)

 Spark source code does not correctly import into eclipse
 

 Key: SPARK-4283
 URL: https://issues.apache.org/jira/browse/SPARK-4283
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Yang Yang
Priority: Minor
 Attachments: spark_eclipse.diff


 when I import spark src into eclipse, either by mvn eclipse:eclipse, then 
 import existing general projects or import existing maven projects, it 
 does not recognize the project as a scala project. 
 I am adding a new plugin , so import works



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4275) ./sbt/sbt assembly command fails if path has space in the name


[ 
https://issues.apache.org/jira/browse/SPARK-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201855#comment-14201855
 ] 

Apache Spark commented on SPARK-4275:
-

User 'shuhuai007' has created a pull request for this issue:
https://github.com/apache/spark/pull/3156

 ./sbt/sbt assembly command fails if path has space in the name
 

 Key: SPARK-4275
 URL: https://issues.apache.org/jira/browse/SPARK-4275
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Ravi Kiran
Priority: Trivial

 I have downloaded branch-1.1 for building spark from scratch on my MAC. The 
 path had a space like
 /Users/rkgurram/VirtualBox VMs/SPARK/spark-branch-1.1,
 1) I cd to the above directory 
 2) Ran ./sbt/sbt assembly 
 The command fails with weird messages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4275) ./sbt/sbt assembly command fails if path has space in the name


 [ 
https://issues.apache.org/jira/browse/SPARK-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4275.
--
Resolution: Duplicate

You should report issues against head in general, rather than an older branch. 
This was already fixed in https://issues.apache.org/jira/browse/SPARK-3337

 ./sbt/sbt assembly command fails if path has space in the name
 

 Key: SPARK-4275
 URL: https://issues.apache.org/jira/browse/SPARK-4275
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Ravi Kiran
Priority: Trivial

 I have downloaded branch-1.1 for building spark from scratch on my MAC. The 
 path had a space like
 /Users/rkgurram/VirtualBox VMs/SPARK/spark-branch-1.1,
 1) I cd to the above directory 
 2) Ran ./sbt/sbt assembly 
 The command fails with weird messages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

[
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201865#comment-14201865
]

zzc commented on SPARK-2468:

Hi, Aaron Davidson, what do you mean that Is it really the case that each of
your executors is only using 1 core for its 20GB of RAM? It seems like 5 would
be in line with the portion of memory you're using?

I try to set spark.storage.memoryFraction and spark.shuffle.memoryFraction from
0.2 to 0.5 before, OOM still occur.

Netty-based block server / client module

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4275) ./sbt/sbt assembly command fails if path has space in the name

2014-11-07 Thread Ravi Kiran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201876#comment-14201876
 ] 

Ravi Kiran commented on SPARK-4275:
---

Scott,
Thank you, will follow the advise, I am new to the Spark ecosystem and just 
getting my feet wet. 

Regards
-Ravi

 ./sbt/sbt assembly command fails if path has space in the name
 

 Key: SPARK-4275
 URL: https://issues.apache.org/jira/browse/SPARK-4275
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Ravi Kiran
Priority: Trivial

 I have downloaded branch-1.1 for building spark from scratch on my MAC. The 
 path had a space like
 /Users/rkgurram/VirtualBox VMs/SPARK/spark-branch-1.1,
 1) I cd to the above directory 
 2) Ran ./sbt/sbt assembly 
 The command fails with weird messages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4275) ./sbt/sbt assembly command fails if path has space in the name

2014-11-07 Thread Ravi Kiran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201876#comment-14201876
 ] 

Ravi Kiran edited comment on SPARK-4275 at 11/7/14 10:13 AM:
-

Sean,
Thank you, will follow the advise, I am new to the Spark ecosystem and just 
getting my feet wet. 

Regards
-Ravi


was (Author: rkgurram):
Scott,
Thank you, will follow the advise, I am new to the Spark ecosystem and just 
getting my feet wet. 

Regards
-Ravi

 ./sbt/sbt assembly command fails if path has space in the name
 

 Key: SPARK-4275
 URL: https://issues.apache.org/jira/browse/SPARK-4275
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.1.0
Reporter: Ravi Kiran
Priority: Trivial

 I have downloaded branch-1.1 for building spark from scratch on my MAC. The 
 path had a space like
 /Users/rkgurram/VirtualBox VMs/SPARK/spark-branch-1.1,
 1) I cd to the above directory 
 2) Ran ./sbt/sbt assembly 
 The command fails with weird messages



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause

Shixiong Zhu created SPARK-4296:
---

 Summary: Throw Expression not in GROUP BY when using same 
expression in group by clause and  select clause
 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Shixiong Zhu


When the input data has a complex structure, using same expression in group by 
clause and  select clause will throw Expression not in GROUP BY.

{code:java}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
case class Birthday(date: String)
case class Person(name: String, birthday: Birthday)
val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), 
Person(Jim, Birthday(1980-02-28
people.registerTempTable(people)
val year = sqlContext.sql(select count(*), upper(birthday.date) from people 
group by upper(birthday.date))
year.collect
{code}

Here is the plan of year:
{code:java}
SchemaRDD[3] at RDD at SchemaRDD.scala:105
== Query Plan ==
== Physical Plan ==
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not 
in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date AS 
date#9) AS c1#3]
 Subquery people
  LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
ExistingRDD.scala:36
{code}

The bug is the equality test for `Upper(birthday#1.date)` and 
`Upper(birthday#1.date AS date#9)`.

Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause


[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201934#comment-14201934
 ] 

Shixiong Zhu edited comment on SPARK-4296 at 11/7/14 11:21 AM:
---

Stack trace:
{code:java}
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression not 
in GROUP BY: Upper(birthday#11.date AS date#17) AS c1#13, tree:
Aggregate [Upper(birthday#11.date)], [COUNT(1) AS c0#12L,Upper(birthday#11.date 
AS date#17) AS c1#13]
 Subquery people
  LogicalRDD [name#10,birthday#11], MapPartitionsRDD[5] at mapPartitions at 
ExistingRDD.scala:36

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:133)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:130)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:130)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)

{code}


was (Author: zsxwing):
Stack trace:
{code:java}
Aggregate [Upper(birthday#11.date)], [COUNT(1) AS c0#12L,Upper(birthday#11.date 
AS date#17) AS c1#13]
 Subquery people
  LogicalRDD [name#10,birthday#11], MapPartitionsRDD[5] at mapPartitions at 
ExistingRDD.scala:36

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:133)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:130)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:130)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)

{code}

 Throw Expression not in GROUP BY when using same expression in group by 
 clause and  select clause
 ---

 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Shixiong Zhu

[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause


[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201934#comment-14201934
 ] 

Shixiong Zhu commented on SPARK-4296:
-

Stack trace:
{code:java}
Aggregate [Upper(birthday#11.date)], [COUNT(1) AS c0#12L,Upper(birthday#11.date 
AS date#17) AS c1#13]
 Subquery people
  LogicalRDD [name#10,birthday#11], MapPartitionsRDD[5] at mapPartitions at 
ExistingRDD.scala:36

at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:133)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3$$anonfun$applyOrElse$7.apply(Analyzer.scala:130)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:130)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$$anonfun$apply$3.applyOrElse(Analyzer.scala:115)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:115)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$CheckAggregation$.apply(Analyzer.scala:113)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)

{code}

 Throw Expression not in GROUP BY when using same expression in group by 
 clause and  select clause
 ---

 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Shixiong Zhu

 When the input data has a complex structure, using same expression in group 
 by clause and  select clause will throw Expression not in GROUP BY.
 {code:java}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Birthday(date: String)
 case class Person(name: String, birthday: Birthday)
 val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), 
 Person(Jim, Birthday(1980-02-28
 people.registerTempTable(people)
 val year = sqlContext.sql(select count(*), upper(birthday.date) from people 
 group by upper(birthday.date))
 year.collect
 {code}
 Here is the plan of year:
 {code:java}
 SchemaRDD[3] at RDD at SchemaRDD.scala:105
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
 Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
 AS date#9) AS c1#3]
  Subquery people
   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:36
 {code}
 The bug is the equality test for `Upper(birthday#1.date)` and 
 `Upper(birthday#1.date AS date#9)`.
 Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
 expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause


[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201937#comment-14201937
 ] 

Shixiong Zhu commented on SPARK-4296:
-

Original reported by Tridib Samanta at 
http://apache-spark-user-list.1001560.n3.nabble.com/sql-group-by-on-UDF-not-working-td18339.html

 Throw Expression not in GROUP BY when using same expression in group by 
 clause and  select clause
 ---

 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Shixiong Zhu

 When the input data has a complex structure, using same expression in group 
 by clause and  select clause will throw Expression not in GROUP BY.
 {code:java}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Birthday(date: String)
 case class Person(name: String, birthday: Birthday)
 val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), 
 Person(Jim, Birthday(1980-02-28
 people.registerTempTable(people)
 val year = sqlContext.sql(select count(*), upper(birthday.date) from people 
 group by upper(birthday.date))
 year.collect
 {code}
 Here is the plan of year:
 {code:java}
 SchemaRDD[3] at RDD at SchemaRDD.scala:105
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
 Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
 AS date#9) AS c1#3]
  Subquery people
   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:36
 {code}
 The bug is the equality test for `Upper(birthday#1.date)` and 
 `Upper(birthday#1.date AS date#9)`.
 Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
 expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later

2014-11-07 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201959#comment-14201959
 ] 

Tsuyoshi OZAWA commented on SPARK-4267:
---

[~sandyr] [~pwendell] do you have any workarounds to deal with this problem?

 Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
 --

 Key: SPARK-4267
 URL: https://issues.apache.org/jira/browse/SPARK-4267
 Project: Spark
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA

 Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 
 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this:
 {code}
  ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn 
 -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0
 {code}
 Then Spark on YARN fails to launch jobs with NPE.
 {code}
 $ bin/spark-shell --master yarn-client
 scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line 
 = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a 
 + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2);
 java.lang.NullPointerException
   
   
 
 at 
 org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284)
 at 
 org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291)   
   
   
  
 at 
 org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480)
 at $iwC$$iwC$$iwC$$iwC.init(console:13)   
   
   
 
 at $iwC$$iwC$$iwC.init(console:18)
 at $iwC$$iwC.init(console:20) 
   
   
 
 at $iwC.init(console:22)
 at init(console:24)   
   
   
 
 at .init(console:28)
 at .clinit(console)   
   
   
 
 at .init(console:7)
 at .clinit(console)   
   
   
 
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
   
 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   
   
   
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) 
   
   
  
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
 at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
   
   
  
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)

[jira] [Commented] (SPARK-2429) Hierarchical Implementation of KMeans

2014-11-07 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14201990#comment-14201990
 ] 

Yu Ishikawa commented on SPARK-2429:


Hi [~rnowling], 

I have a suggestion to you about new function. I think it is difficult for this 
algorithm to have an advantage in computational complexity. So I implemented a 
function to cut a cluster tree as a result of clustering by height. This 
function restructures a cluster tree, not changing the original cluster tree. 
We can control the number of clusters in a cluster tree by height without 
recomputation.  This is an advantage against KMeans and other clustering 
algorighms.

You can see a test code at below URL.
[https://github.com/yu-iskw/spark/blob/8355f959f02ca67454c9cb070912480db0a44671/mllib/src/test/scala/org/apache/spark/mllib/clustering/HierarchicalClusteringModelSuite.scala#L116]

 Hierarchical Implementation of KMeans
 -

 Key: SPARK-2429
 URL: https://issues.apache.org/jira/browse/SPARK-2429
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: Yu Ishikawa
Priority: Minor
 Attachments: 2014-10-20_divisive-hierarchical-clustering.pdf, The 
 Result of Benchmarking a Hierarchical Clustering.pdf, 
 benchmark-result.2014-10-29.html, benchmark2.html


 Hierarchical clustering algorithms are widely used and would make a nice 
 addition to MLlib.  Clustering algorithms are useful for determining 
 relationships between clusters as well as offering faster assignment. 
 Discussion on the dev list suggested the following possible approaches:
 * Top down, recursive application of KMeans
 * Reuse DecisionTree implementation with different objective function
 * Hierarchical SVD
 It was also suggested that support for distance metrics other than Euclidean 
 such as negative dot or cosine are necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4297) Build warning fixes omnibus

Sean Owen created SPARK-4297:


 Summary: Build warning fixes omnibus
 Key: SPARK-4297
 URL: https://issues.apache.org/jira/browse/SPARK-4297
 Project: Spark
  Issue Type: Improvement
  Components: Build, Java API
Affects Versions: 1.1.0
Reporter: Sean Owen
Priority: Minor


There are a number of warnings generated in a normal, successful build right 
now. They're mostly Java unchecked cast warnings, which can be suppressed. But 
there's a grab bag of other Scala language warnings and so on that can all be 
easily fixed. The forthcoming PR fixes about 90% of the build warnings I see 
now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4297) Build warning fixes omnibus


[ 
https://issues.apache.org/jira/browse/SPARK-4297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202030#comment-14202030
 ] 

Apache Spark commented on SPARK-4297:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3157

 Build warning fixes omnibus
 ---

 Key: SPARK-4297
 URL: https://issues.apache.org/jira/browse/SPARK-4297
 Project: Spark
  Issue Type: Improvement
  Components: Build, Java API
Affects Versions: 1.1.0
Reporter: Sean Owen
Priority: Minor

 There are a number of warnings generated in a normal, successful build right 
 now. They're mostly Java unchecked cast warnings, which can be suppressed. 
 But there's a grab bag of other Scala language warnings and so on that can 
 all be easily fixed. The forthcoming PR fixes about 90% of the build warnings 
 I see now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-11-07 Thread Lianhui Wang (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202056#comment-14202056
]

Lianhui Wang commented on SPARK-2468:
-

[~adav] yes,with https://github.com/apache/spark/pull/3155/ in my test it
does not happened.but i discover that Netty's performance is not good than
NioBlockTransferService. so I need to find why Netty's performance is bad than
NioBlockTransferService in my test.Can you give me some suggestions? thanks.and
how about your test? [~zzcclp]

Netty-based block server / client module

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2468) Netty-based block server / client module

2014-11-07 Thread Lianhui Wang (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202056#comment-14202056
]

Lianhui Wang edited comment on SPARK-2468 at 11/7/14 2:01 PM:
--

[~adav] yes,with https://github.com/apache/spark/pull/3155/ in my test
beyond physical memory limits does not happened.but i discover that Netty's
performance is not good than NioBlockTransferService. so I need to find why
Netty's performance is bad than NioBlockTransferService in my test.Can you give
me some suggestions? thanks.and how about your test? [~zzcclp]

was (Author: lianhuiwang):
[~adav] yes,with https://github.com/apache/spark/pull/3155/ in my test it
does not happened.but i discover that Netty's performance is not good than
NioBlockTransferService. so I need to find why Netty's performance is bad than
NioBlockTransferService in my test.Can you give me some suggestions? thanks.and
how about your test? [~zzcclp]

Netty-based block server / client module

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4298) The spark-submit cannot read Main-Class from Manifest.

2014-11-07 Thread Milan Straka (JIRA)

Milan Straka created SPARK-4298:
---

 Summary: The spark-submit cannot read Main-Class from Manifest.
 Key: SPARK-4298
 URL: https://issues.apache.org/jira/browse/SPARK-4298
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
 Environment: Linux
spark-1.1.0-bin-hadoop2.4.tgz
java version 1.7.0_72
Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)
Reporter: Milan Straka


Consider trivial {{test.scala}}:
{code:title=test.scala|borderStyle=solid}
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object Main {
  def main(args: Array[String]) {
val sc = new SparkContext()
sc.stop()
  }
}
{code}

When built with {{sbt}} and executed using {{spark-submit 
target/scala-2.10/test_2.10-1.0.jar}}, I get the following error:
{code}
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Error: Cannot load main class from JAR: 
file:/ha/home/straka/s/target/scala-2.10/test_2.10-1.0.jar
Run with --help for usage help or --verbose for debug output
{code}

When executed using {{spark-submit --class Main 
target/scala-2.10/test_2.10-1.0.jar}}, it works.

The jar file has correct MANIFEST.MF:
{code:title=MANIFEST.MF|borderStyle=solid}
Manifest-Version: 1.0
Implementation-Vendor: test
Implementation-Title: test
Implementation-Version: 1.0
Implementation-Vendor-Id: test
Specification-Vendor: test
Specification-Title: test
Specification-Version: 1.0
Main-Class: Main
{code}

The problem is that in {{org.apache.spark.deploy.SparkSubmitArguments}}, line 
127:
{code}
  val jar = new JarFile(primaryResource)
{code}
the primaryResource has String value 
{{file:/ha/home/straka/s/target/scala-2.10/test_2.10-1.0.jar}}, which is URI, 
but JarFile can use only Path. One way to fix this would be using
{code}
  val uri = new URI(primaryResource)
  val jar = new JarFile(uri.getPath)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

[
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202136#comment-14202136
]

zzc commented on SPARK-2468:

The performance of Netty is worse than NIO in my test. Why?@Aaron Davidson.

I want to improve the performance of shuffle, with 500G of shuffle data, the
performance is more worse than hadoop.

Netty-based block server / client module

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4205) Timestamp and Date objects with comparison operators


[ 
https://issues.apache.org/jira/browse/SPARK-4205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202251#comment-14202251
 ] 

Apache Spark commented on SPARK-4205:
-

User 'culler' has created a pull request for this issue:
https://github.com/apache/spark/pull/3158

 Timestamp and Date objects with comparison operators
 

 Key: SPARK-4205
 URL: https://issues.apache.org/jira/browse/SPARK-4205
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Marc Culler
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4300) Race condition during SparkWorker shutdown

2014-11-07 Thread Alex Liu (JIRA)

Alex Liu created SPARK-4300:
---

 Summary: Race condition during SparkWorker shutdown
 Key: SPARK-4300
 URL: https://issues.apache.org/jira/browse/SPARK-4300
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.1.0
Reporter: Alex Liu
Priority: Minor


When a shark job is done. there are some error message as following show in the 
log

{code}
INFO 22:10:41,635 SparkMaster: 
akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
disassociated, removing it.
 INFO 22:10:41,640 SparkMaster: Removing app app-20141106221014-
 INFO 22:10:41,687 SparkMaster: Removing application 
Shark::ip-172-31-11-204.us-west-1.compute.internal
 INFO 22:10:41,710 SparkWorker: Asked to kill executor app-20141106221014-/0
 INFO 22:10:41,712 SparkWorker: Runner thread for executor 
app-20141106221014-/0 interrupted
 INFO 22:10:41,714 SparkWorker: Killing process!
ERROR 22:10:41,738 SparkWorker: Error writing stream to file 
/var/lib/spark/work/app-20141106221014-/0/stdout
ERROR 22:10:41,739 SparkWorker: java.io.IOException: Stream closed
ERROR 22:10:41,739 SparkWorker: at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
ERROR 22:10:41,740 SparkWorker: at 
java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
ERROR 22:10:41,740 SparkWorker: at 
java.io.BufferedInputStream.read(BufferedInputStream.java:334)
ERROR 22:10:41,740 SparkWorker: at 
java.io.FilterInputStream.read(FilterInputStream.java:107)
ERROR 22:10:41,741 SparkWorker: at 
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
ERROR 22:10:41,741 SparkWorker: at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
ERROR 22:10:41,741 SparkWorker: at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
ERROR 22:10:41,742 SparkWorker: at 
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
ERROR 22:10:41,742 SparkWorker: at 
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
ERROR 22:10:41,742 SparkWorker: at 
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
 INFO 22:10:41,838 SparkMaster: Connected to Cassandra cluster: 4299
 INFO 22:10:41,839 SparkMaster: Adding host 172.31.11.204 (Analytics)
 INFO 22:10:41,840 SparkMaster: New Cassandra host /172.31.11.204:9042 added
 INFO 22:10:41,841 SparkMaster: Adding host 172.31.11.204 (Analytics)
 INFO 22:10:41,842 SparkMaster: Adding host 172.31.11.204 (Analytics)
 INFO 22:10:41,852 SparkMaster: 
akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
disassociated, removing it.
 INFO 22:10:41,853 SparkMaster: 
akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
disassociated, removing it.
 INFO 22:10:41,853 SparkMaster: 
akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
disassociated, removing it.
 INFO 22:10:41,857 SparkMaster: 
akka.tcp://sparkdri...@ip-172-31-11-204.us-west-1.compute.internal:57641 got 
disassociated, removing it.
 INFO 22:10:41,862 SparkMaster: Adding host 172.31.11.204 (Analytics)
 WARN 22:10:42,200 SparkMaster: Got status update for unknown executor 
app-20141106221014-/0
 INFO 22:10:42,211 SparkWorker: Executor app-20141106221014-/0 finished 
with state KILLED exitStatus 143
{code}

/var/lib/spark/work/app-20141106221014-/0/stdout is on the disk. It is 
trying to write to a close IO stream. 

Spark worker shuts down by {code}
 private def killProcess(message: Option[String]) {
var exitCode: Option[Int] = None
logInfo(Killing process!)
process.destroy()
process.waitFor()
if (stdoutAppender != null) {
  stdoutAppender.stop()
}
if (stderrAppender != null) {
  stderrAppender.stop()
}
if (process != null) {
exitCode = Some(process.waitFor())
}
worker ! ExecutorStateChanged(appId, execId, state, message, exitCode)
 
{code}

But stdoutAppender concurrently writes to output log file, which creates race 
condition. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4267) Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later

2014-11-07 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202319#comment-14202319
 ] 

Sandy Ryza commented on SPARK-4267:
---

Strange.  Checked in the code and it seems like this must mean the 
taskScheduler is null.  Did you see any errors farther up in the shell before 
this happened?  Does it work in local mode?

 Failing to launch jobs on Spark on YARN with Hadoop 2.5.0 or later
 --

 Key: SPARK-4267
 URL: https://issues.apache.org/jira/browse/SPARK-4267
 Project: Spark
  Issue Type: Bug
Reporter: Tsuyoshi OZAWA

 Currently we're trying Spark on YARN included in Hadoop 2.5.1. Hadoop 2.5 
 uses protobuf 2.5.0 so I compiled with protobuf 2.5.1 like this:
 {code}
  ./make-distribution.sh --name spark-1.1.1 --tgz -Pyarn 
 -Dhadoop.version=2.5.1 -Dprotobuf.version=2.5.0
 {code}
 Then Spark on YARN fails to launch jobs with NPE.
 {code}
 $ bin/spark-shell --master yarn-client
 scala sc.textFile(hdfs:///user/ozawa/wordcountInput20G).flatMap(line 
 = line.split( )).map(word = (word, 1)).persist().reduceByKey((a, b) = a 
 + b, 16).saveAsTextFile(hdfs:///user/ozawa/sparkWordcountOutNew2);
 java.lang.NullPointerException
   
   
 
 at 
 org.apache.spark.SparkContext.defaultParallelism(SparkContext.scala:1284)
 at 
 org.apache.spark.SparkContext.defaultMinPartitions(SparkContext.scala:1291)   
   
   
  
 at 
 org.apache.spark.SparkContext.textFile$default$2(SparkContext.scala:480)
 at $iwC$$iwC$$iwC$$iwC.init(console:13)   
   
   
 
 at $iwC$$iwC$$iwC.init(console:18)
 at $iwC$$iwC.init(console:20) 
   
   
 
 at $iwC.init(console:22)
 at init(console:24)   
   
   
 
 at .init(console:28)
 at .clinit(console)   
   
   
 
 at .init(console:7)
 at .clinit(console)   
   
   
 
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   
   
 
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   
   
   
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789) 
   
   
  
 at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
 at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
   
   
  
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)

[jira] [Commented] (SPARK-4296) Throw Expression not in GROUP BY when using same expression in group by clause and select clause

2014-11-07 Thread Tridib Samanta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202326#comment-14202326
 ] 

Tridib Samanta commented on SPARK-4296:
---

I wish we can use alias of calculated column in group by clause, which will 
avoid specifying long calculated fields to be repeated.

 Throw Expression not in GROUP BY when using same expression in group by 
 clause and  select clause
 ---

 Key: SPARK-4296
 URL: https://issues.apache.org/jira/browse/SPARK-4296
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Shixiong Zhu

 When the input data has a complex structure, using same expression in group 
 by clause and  select clause will throw Expression not in GROUP BY.
 {code:java}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Birthday(date: String)
 case class Person(name: String, birthday: Birthday)
 val people = sc.parallelize(List(Person(John, Birthday(1990-01-22)), 
 Person(Jim, Birthday(1980-02-28
 people.registerTempTable(people)
 val year = sqlContext.sql(select count(*), upper(birthday.date) from people 
 group by upper(birthday.date))
 year.collect
 {code}
 Here is the plan of year:
 {code:java}
 SchemaRDD[3] at RDD at SchemaRDD.scala:105
 == Query Plan ==
 == Physical Plan ==
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Expression 
 not in GROUP BY: Upper(birthday#1.date AS date#9) AS c1#3, tree:
 Aggregate [Upper(birthday#1.date)], [COUNT(1) AS c0#2L,Upper(birthday#1.date 
 AS date#9) AS c1#3]
  Subquery people
   LogicalRDD [name#0,birthday#1], MapPartitionsRDD[1] at mapPartitions at 
 ExistingRDD.scala:36
 {code}
 The bug is the equality test for `Upper(birthday#1.date)` and 
 `Upper(birthday#1.date AS date#9)`.
 Maybe Spark SQL needs a mechanism to compare Alias expression and non-Alias 
 expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4280) In dynamic allocation, add option to never kill executors with cached blocks

2014-11-07 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202382#comment-14202382
 ] 

Sandy Ryza commented on SPARK-4280:
---

So it looks like the block IDs of broadcast variables on each node are the same 
broadcast IDs used on the driver.  Which means it wouldn't be too hard to do 
this filtering.  Even without it, this would still be useful.  What do you 
think? 

 In dynamic allocation, add option to never kill executors with cached blocks
 

 Key: SPARK-4280
 URL: https://issues.apache.org/jira/browse/SPARK-4280
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza

 Even with the external shuffle service, this is useful in situations like 
 Hive on Spark where a query might require caching some data. We want to be 
 able to give back executors after the job ends, but not during the job if it 
 would delete intermediate results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab


[ 
https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202420#comment-14202420
 ] 

Josh Rosen commented on SPARK-4216:
---

A large part of the problem is that the Jenkins GHPRB plugin has a lot of 
settings that are global rather than per-project.  In this case, I think the 
duplicate postings are being generated by the fall back on posting comments in 
case the GitHub commit status API call fails.  We can't use the status API in 
Spark, but I guess the other AMP Lab projects used to use it and didn't require 
this fallback.  At some point, I think we switched the comment fallback on 
because some other project needed it, leading to these duplicate updates.

As I've commented elsewhere, one solution would be to simply not use the GHPRB 
plugin for Spark and instead use a parameterized build that's triggered 
remotely (e.g. through spark-prs.appspot.com).  I think that we could easily 
build this layer on top of spark-prs; it's just a matter of finding the time to 
do it (and to add the necessary features, like automatic detection of when new 
commits have been pushed, listening to commands addressed to Jenkins, etc.)  I 
already have the triggering working manually (this runs 
NewSparkPullRequestBuilder), so the only remaining piece is the automatic 
triggering / ACLs.

 Eliminate duplicate Jenkins GitHub posts from AMPLab
 

 Key: SPARK-4216
 URL: https://issues.apache.org/jira/browse/SPARK-4216
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 * [Real Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873361]
 * [Imposter Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873366]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab


[ 
https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202420#comment-14202420
 ] 

Josh Rosen edited comment on SPARK-4216 at 11/7/14 6:38 PM:


A large part of the problem is that the Jenkins GHPRB plugin has a lot of 
settings that are global rather than per-project.  In this case, I think the 
duplicate postings are being generated by the fall back on posting comments in 
case the GitHub commit status API call fails setting.  We can't use the status 
API in Spark, but I guess the other AMP Lab projects used to use it and didn't 
require this fallback.  At some point, I think we switched the comment fallback 
on because some other project needed it, leading to these duplicate updates.

As I've commented elsewhere, one solution would be to simply not use the GHPRB 
plugin for Spark and instead use a parameterized build that's triggered 
remotely (e.g. through spark-prs.appspot.com).  I think that we could easily 
build this layer on top of spark-prs; it's just a matter of finding the time to 
do it (and to add the necessary features, like automatic detection of when new 
commits have been pushed, listening to commands addressed to Jenkins, etc.)  I 
already have the triggering working manually (this runs 
NewSparkPullRequestBuilder), so the only remaining piece is the automatic 
triggering / ACLs.


was (Author: joshrosen):
A large part of the problem is that the Jenkins GHPRB plugin has a lot of 
settings that are global rather than per-project.  In this case, I think the 
duplicate postings are being generated by the fall back on posting comments in 
case the GitHub commit status API call fails.  We can't use the status API in 
Spark, but I guess the other AMP Lab projects used to use it and didn't require 
this fallback.  At some point, I think we switched the comment fallback on 
because some other project needed it, leading to these duplicate updates.

As I've commented elsewhere, one solution would be to simply not use the GHPRB 
plugin for Spark and instead use a parameterized build that's triggered 
remotely (e.g. through spark-prs.appspot.com).  I think that we could easily 
build this layer on top of spark-prs; it's just a matter of finding the time to 
do it (and to add the necessary features, like automatic detection of when new 
commits have been pushed, listening to commands addressed to Jenkins, etc.)  I 
already have the triggering working manually (this runs 
NewSparkPullRequestBuilder), so the only remaining piece is the automatic 
triggering / ACLs.

 Eliminate duplicate Jenkins GitHub posts from AMPLab
 

 Key: SPARK-4216
 URL: https://issues.apache.org/jira/browse/SPARK-4216
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 * [Real Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873361]
 * [Imposter Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873366]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4301) StreamingContext should not allow start() to be called after calling stop()

Josh Rosen created SPARK-4301:
-

 Summary: StreamingContext should not allow start() to be called 
after calling stop()
 Key: SPARK-4301
 URL: https://issues.apache.org/jira/browse/SPARK-4301
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0, 1.0.2, 1.0.0, 1.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen


In Spark 1.0.0+, calling {{stop()}} on a StreamingContext that has not been 
started is a no-op which has no side-effects.  This allows users to call 
{{stop()}} on a fresh StreamingContext followed by {{start()}}.  I believe that 
this almost always indicates an error and is not behavior that we should 
support.  Since we don't allow {{start() stop() start()}} then I don't think it 
makes sense to allow {{stop() start()}}.

The current behavior can lead to resource leaks when StreamingContext 
constructs its own SparkContext: if I call {{stop(stopSparkContext=True)}}, 
then I expect StreamingContext's underlying SparkContext to be stopped 
irrespective of whether the StreamingContext has been started.  This is useful 
when writing unit test fixtures.

Prior discussions:

- https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490
- https://github.com/apache/spark/pull/3121#issuecomment-61927353



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2468) Netty-based block server / client module

[
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202490#comment-14202490
]

Aaron Davidson commented on SPARK-2468:
---

[~lianhuiwang] Can you try again with preferDirectBufs set to true, and just
setting maxUsableCores down to the number of cores each container actually has?
It's possible the performance discrepancy you're seeing is simply due to heap
byte buffers not being as fast as direct ones. You might also decrease the Java
heap size a bit while keeping the container size the same, if _any_ direct
memory allocation is causing the container to be killed.

[~zzcclp] Same suggestion for you about setting preferDirectBufs to true and
setting maxUsableCores down, but I will also perform another round of
benchmarking -- it's possible we accidentally introduced a performance
regression in the last few patches.

Comparing Hadoop vs Spark performance is a different matter. A few suggestions
on your setup: You should set executor-cores to 5, so that each executor is
actually using 5 cores instead of just 1. You're losing significant parallelism
because of this setting, as Spark will only launch 1 task per core on an
executor at any given time. Second, groupBy() is inefficient (it's doc was
changed recently to reflect this), and should be avoided. I would recommend
changing your job to sort the whole RDD using something similar to
{code}mapR.map { x = ((x._1._1, x._2._1), x) }.sortByKey(){code}, which would
not require that all values for a single group fit in memory. This would still
effectively group by x._1._1, but would sort within each group by x._2._1, and
would utilize Spark's efficient sorting machinery.

Netty-based block server / client module

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4301) StreamingContext should not allow start() to be called after calling stop()


[ 
https://issues.apache.org/jira/browse/SPARK-4301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202503#comment-14202503
 ] 

Apache Spark commented on SPARK-4301:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/3160

 StreamingContext should not allow start() to be called after calling stop()
 ---

 Key: SPARK-4301
 URL: https://issues.apache.org/jira/browse/SPARK-4301
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.0.2, 1.1.0, 1.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen

 In Spark 1.0.0+, calling {{stop()}} on a StreamingContext that has not been 
 started is a no-op which has no side-effects.  This allows users to call 
 {{stop()}} on a fresh StreamingContext followed by {{start()}}.  I believe 
 that this almost always indicates an error and is not behavior that we should 
 support.  Since we don't allow {{start() stop() start()}} then I don't think 
 it makes sense to allow {{stop() start()}}.
 The current behavior can lead to resource leaks when StreamingContext 
 constructs its own SparkContext: if I call {{stop(stopSparkContext=True)}}, 
 then I expect StreamingContext's underlying SparkContext to be stopped 
 irrespective of whether the StreamingContext has been started.  This is 
 useful when writing unit test fixtures.
 Prior discussions:
 - https://github.com/apache/spark/pull/3053#discussion-diff-19710333R490
 - https://github.com/apache/spark/pull/3121#issuecomment-61927353



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3337) Paranoid quoting in shell to allow install dirs with spaces within.


 [ 
https://issues.apache.org/jira/browse/SPARK-3337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3337:
--
Fix Version/s: (was: 1.1.1)
   1.2.0

Looks like the Fix versions are wrong here, since this patch only made it 
into master / 1.2.0, so I'm removing 1.1.1 as a Fix version and adding 
1.2.0.

 Paranoid quoting in shell to allow install dirs with spaces within.
 ---

 Key: SPARK-3337
 URL: https://issues.apache.org/jira/browse/SPARK-3337
 Project: Spark
  Issue Type: Improvement
  Components: Build, Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Prashant Sharma
Assignee: Prashant Sharma
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab

2014-11-07 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202518#comment-14202518
 ] 

shane knapp commented on SPARK-4216:


yep, not running ghprb for spark is a totally legit option as well (which
i'd forgotten about -- this was something we'd spoken about josh).

just be aware that you're adding a new layer of tooling, which is fine, but
it will need to be documented, reviewed and support.  i can help support
amplab-based stuff (ie: things on our end), but once we're adding in things
like remote triggers from appspot, i'll need to draw a support line.  :)

@nicholas -- those example you showed me are from when the amplab jenkins
bot was broken, and not posting.

btw, i turned down the number of amplab jenkins bot posts a while back to a
minimum, so as not to spam spark builds.

so, we:

1) we carry on w/the duplicate postings (annoying, but not dangerous)
2) spark starts using it's own bot/trigger system (needs a lot of work)

(1) for now, (2) when you guys can find some time to make it happen?





 Eliminate duplicate Jenkins GitHub posts from AMPLab
 

 Key: SPARK-4216
 URL: https://issues.apache.org/jira/browse/SPARK-4216
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 * [Real Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873361]
 * [Imposter Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873366]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4225) jdbc/odbc error when using maven build spark


 [ 
https://issues.apache.org/jira/browse/SPARK-4225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4225.
-
Resolution: Fixed

Issue resolved by pull request 3105
[https://github.com/apache/spark/pull/3105]

 jdbc/odbc error when using maven build spark
 

 Key: SPARK-4225
 URL: https://issues.apache.org/jira/browse/SPARK-4225
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 1.1.0
Reporter: wangfei
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.2.0


 use command as follows to build spark
 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.1 -Phive -DskipTests clean 
 package
 then use beeline to connect to thrift server ,get this error:
  
 14/11/04 11:30:31 INFO ObjectStore: Initialized ObjectStore
 14/11/04 11:30:31 INFO AbstractService: Service:ThriftBinaryCLIService is 
 started.
 14/11/04 11:30:31 INFO AbstractService: Service:HiveServer2 is started.
 14/11/04 11:30:31 INFO HiveThriftServer2: HiveThriftServer2 started
 14/11/04 11:30:31 INFO ThriftCLIService: ThriftBinaryCLIService listening on 
 0.0.0.0/0.0.0.0:1
 14/11/04 11:33:26 INFO ThriftCLIService: Client protocol version: 
 HIVE_CLI_SERVICE_PROTOCOL_V6
 14/11/04 11:33:26 INFO HiveMetaStore: No user is added in admin role, since 
 config is empty
 14/11/04 11:33:26 INFO SessionState: No Tez session required at this point. 
 hive.execution.engine=mr.
 14/11/04 11:33:26 INFO SessionState: No Tez session required at this point. 
 hive.execution.engine=mr.
 14/11/04 11:33:26 ERROR TThreadPoolServer: Thrift error occurred during 
 processing of message.
 org.apache.thrift.protocol.TProtocolException: Cannot write a TUnion with no 
 set value!
   at org.apache.thrift.TUnion$TUnionStandardScheme.write(TUnion.java:240)
   at org.apache.thrift.TUnion$TUnionStandardScheme.write(TUnion.java:213)
   at org.apache.thrift.TUnion.write(TUnion.java:152)
   at 
 org.apache.hive.service.cli.thrift.TGetInfoResp$TGetInfoRespStandardScheme.write(TGetInfoResp.java:456)
   at 
 org.apache.hive.service.cli.thrift.TGetInfoResp$TGetInfoRespStandardScheme.write(TGetInfoResp.java:406)
   at 
 org.apache.hive.service.cli.thrift.TGetInfoResp.write(TGetInfoResp.java:341)
   at 
 org.apache.hive.service.cli.thrift.TCLIService$GetInfo_result$GetInfo_resultStandardScheme.write(TCLIService.java:3754)
   at 
 org.apache.hive.service.cli.thrift.TCLIService$GetInfo_result$GetInfo_resultStandardScheme.write(TCLIService.java:3718)
   at 
 org.apache.hive.service.cli.thrift.TCLIService$GetInfo_result.write(TCLIService.java:3669)
   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:53)
   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
   at 
 org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:55)
   at 
 org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:206)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-11-07 Thread Norman He (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202548#comment-14202548
 ] 

Norman He commented on SPARK-2447:
--

Hi Ted,

I have already made some changes in scala for facading and added some tests. 
Let us discuss early next week. How should I send you the code reviews?

-Norman

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab

2014-11-07 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202553#comment-14202553
 ] 

Nicholas Chammas commented on SPARK-4216:
-

{quote}
1) we carry on w/the duplicate postings (annoying, but not dangerous)
2) spark starts using it's own bot/trigger system (needs a lot of work)

(1) for now, (2) when you guys can find some time to make it happen?
{quote}

Seems sensible to me. Long term, (2) seems like the right thing to do for Spark 
if we're gonna stay on the AMPLab cluster.

 Eliminate duplicate Jenkins GitHub posts from AMPLab
 

 Key: SPARK-4216
 URL: https://issues.apache.org/jira/browse/SPARK-4216
 Project: Spark
  Issue Type: Bug
  Components: Build, Project Infra
Reporter: Nicholas Chammas
Priority: Minor

 * [Real Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873361]
 * [Imposter Jenkins | 
 https://github.com/apache/spark/pull/2988#issuecomment-60873366]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4213) SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators


 [ 
https://issues.apache.org/jira/browse/SPARK-4213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4213.
-
Resolution: Fixed

Issue resolved by pull request 3083
[https://github.com/apache/spark/pull/3083]

 SparkSQL - ParquetFilters - No support for LT, LTE, GT, GTE operators
 -

 Key: SPARK-4213
 URL: https://issues.apache.org/jira/browse/SPARK-4213
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
 Environment: CDH5.2, Hive 0.13.1, Spark 1.2 snapshot (commit hash 
 76386e1a23c)
Reporter: Terry Siu
Priority: Blocker
 Fix For: 1.2.0


 When I issue a hql query against a HiveContext where my predicate uses a 
 column of string type with one of LT, LTE, GT, or GTE operator, I get the 
 following error:
 scala.MatchError: StringType (of class 
 org.apache.spark.sql.catalyst.types.StringType$)
 Looking at the code in org.apache.spark.sql.parquet.ParquetFilters, 
 StringType is absent from the corresponding functions for creating these 
 filters.
 To reproduce, in a Hive 0.13.1 shell, I created the following table (at a 
 specified DB):
 create table sparkbug (
   id int,
   event string
 ) stored as parquet;
 Insert some sample data:
 insert into table sparkbug select 1, '2011-06-18' from some table limit 1;
 insert into table sparkbug select 2, '2012-01-01' from some table limit 1;
 Launch a spark shell and create a HiveContext to the metastore where the 
 table above is located.
 import org.apache.spark.sql._
 import org.apache.spark.sql.SQLContext
 import org.apache.spark.sql.hive.HiveContext
 val hc = new HiveContext(sc)
 hc.setConf(spark.sql.shuffle.partitions, 10)
 hc.setConf(spark.sql.hive.convertMetastoreParquet, true)
 hc.setConf(spark.sql.parquet.compression.codec, snappy)
 import hc._
 hc.hql(select * from db.sparkbug where event = '2011-12-01')
 A scala.MatchError will appear in the output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4272) Add more unwrap functions for primitive type in TableReader


 [ 
https://issues.apache.org/jira/browse/SPARK-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4272.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3136
[https://github.com/apache/spark/pull/3136]

 Add more unwrap functions for primitive type in TableReader
 ---

 Key: SPARK-4272
 URL: https://issues.apache.org/jira/browse/SPARK-4272
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Priority: Minor
 Fix For: 1.2.0


 Currently, the data unwrap only support couple of primitive types, not all, 
 it will not cause exception, but may get some performance in table scanning 
 for the type like binary, date, timestamp, decimal etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4203) Partition directories in random order when inserting into hive table


 [ 
https://issues.apache.org/jira/browse/SPARK-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4203.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3076
[https://github.com/apache/spark/pull/3076]

 Partition directories in random order when inserting into hive table
 

 Key: SPARK-4203
 URL: https://issues.apache.org/jira/browse/SPARK-4203
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Matthew Taylor
 Fix For: 1.2.0


 When doing an insert into hive table with partitions the folders written to 
 the file system are in a random order instead of the order defined in table 
 creation. Seems that the loadPartition method in Hive.java has a 
 MapString,String parameter but expects to be called with a map that has a 
 defined ordering such as  LinkedHashMap. Have a patch which I will do a PR 
 for. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4292) incorrect result set in JDBC/ODBC


 [ 
https://issues.apache.org/jira/browse/SPARK-4292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4292.
-
Resolution: Fixed

Issue resolved by pull request 3149
[https://github.com/apache/spark/pull/3149]

 incorrect result set in JDBC/ODBC
 -

 Key: SPARK-4292
 URL: https://issues.apache.org/jira/browse/SPARK-4292
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
 Fix For: 1.2.0


 select * from src, get result as follows:
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |
 | 97   | val_97   |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4302) Make jsonRDD/jsonFile support more field data types

2014-11-07 Thread Yin Huai (JIRA)

Yin Huai created SPARK-4302:
---

 Summary: Make jsonRDD/jsonFile support more field data types
 Key: SPARK-4302
 URL: https://issues.apache.org/jira/browse/SPARK-4302
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Yin Huai


Since we allow users to specify schemas, jsonRDD/jsonFile should support all 
Spark SQL data types in the provided schema.

A related post in mailing list: 
http://apache-spark-user-list.1001560.n3.nabble.com/jsonRdd-and-MapType-td18376.html
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class

Jia Xu created SPARK-4303:
-

 Summary: [MLLIB] Use Long IDs instead of Int in ALS.Rating 
class
 Key: SPARK-4303
 URL: https://issues.apache.org/jira/browse/SPARK-4303
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jia Xu


In many big data recommendation applications, the IDs used are usually Long 
type instead of Integer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class


 [ 
https://issues.apache.org/jira/browse/SPARK-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Xu updated SPARK-4303:
--
Description: In many big data recommendation applications, the IDs used for 
users and products are usually Long type instead of Integer.   (was: In 
many big data recommendation applications, the IDs used are usually Long type 
instead of Integer. )

 [MLLIB] Use Long IDs instead of Int in ALS.Rating class
 ---

 Key: SPARK-4303
 URL: https://issues.apache.org/jira/browse/SPARK-4303
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jia Xu

 In many big data recommendation applications, the IDs used for users and 
 products are usually Long type instead of Integer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class


 [ 
https://issues.apache.org/jira/browse/SPARK-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Xu updated SPARK-4303:
--
Description: 
In many big data recommendation applications, the IDs used for users and 
products are usually Long type instead of Integer. 

So a Rating class based on Long IDs should be more useful for these 
applications.
i.e. case class Rating(val user: Long, val product: Long, val rating: Double)

  was:
In many big data recommendation applications, the IDs used for users and 
products are usually Long type instead of Integer. So a Rating class based on 
Long IDs should be more useful for these applications.

case class Rating(val user: Long, val product: Long, val rating: Double)


 [MLLIB] Use Long IDs instead of Int in ALS.Rating class
 ---

 Key: SPARK-4303
 URL: https://issues.apache.org/jira/browse/SPARK-4303
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jia Xu

 In many big data recommendation applications, the IDs used for users and 
 products are usually Long type instead of Integer. 
 So a Rating class based on Long IDs should be more useful for these 
 applications.
 i.e. case class Rating(val user: Long, val product: Long, val rating: Double)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class


 [ 
https://issues.apache.org/jira/browse/SPARK-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jia Xu updated SPARK-4303:
--
Description: 
In many big data recommendation applications, the IDs used for users and 
products are usually Long type instead of Integer. So a Rating class based on 
Long IDs should be more useful for these applications.

case class Rating(val user: Long, val product: Long, val rating: Double)

  was:In many big data recommendation applications, the IDs used for users 
and products are usually Long type instead of Integer. 


 [MLLIB] Use Long IDs instead of Int in ALS.Rating class
 ---

 Key: SPARK-4303
 URL: https://issues.apache.org/jira/browse/SPARK-4303
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jia Xu

 In many big data recommendation applications, the IDs used for users and 
 products are usually Long type instead of Integer. So a Rating class based 
 on Long IDs should be more useful for these applications.
 case class Rating(val user: Long, val product: Long, val rating: Double)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2381) streaming receiver crashed,but seems nothing happened


[ 
https://issues.apache.org/jira/browse/SPARK-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202798#comment-14202798
 ] 

Apache Spark commented on SPARK-2381:
-

User 'joyyoj' has created a pull request for this issue:
https://github.com/apache/spark/pull/1693

 streaming receiver crashed,but seems nothing happened
 -

 Key: SPARK-2381
 URL: https://issues.apache.org/jira/browse/SPARK-2381
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: sunsc

 when we submit a streaming job and if receivers doesn't start normally, the 
 application should stop itself. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2360) CSV import to SchemaRDDs

2014-11-07 Thread Hossein Falaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14202805#comment-14202805
 ] 

Hossein Falaki commented on SPARK-2360:
---

Sure.

 CSV import to SchemaRDDs
 

 Key: SPARK-2360
 URL: https://issues.apache.org/jira/browse/SPARK-2360
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Hossein Falaki

 I think the first step it to design the interface that we want to present to 
 users.  Mostly this is defining options when importing.  Off the top of my 
 head:
 - What is the separator?
 - Provide column names or infer them from the first row.
 - how to handle multiple files with possibly different schemas
 - do we have a method to let users specify the datatypes of the columns or 
 are they just strings?
 - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-2360) CSV import to SchemaRDDs

2014-11-07 Thread Hossein Falaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hossein Falaki closed SPARK-2360.
-

This will be a package using Data Source API

 CSV import to SchemaRDDs
 

 Key: SPARK-2360
 URL: https://issues.apache.org/jira/browse/SPARK-2360
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Hossein Falaki

 I think the first step it to design the interface that we want to present to 
 users.  Mostly this is defining options when importing.  Off the top of my 
 head:
 - What is the separator?
 - Provide column names or infer them from the first row.
 - how to handle multiple files with possibly different schemas
 - do we have a method to let users specify the datatypes of the columns or 
 are they just strings?
 - what types of quoting / escaping do we want to support?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)


 [ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2447:
-
Target Version/s:   (was: 1.2.0)

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3754) Spark Streaming fileSystem API is not callable from Java


 [ 
https://issues.apache.org/jira/browse/SPARK-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-3754:
-
Target Version/s: 1.2.0

 Spark Streaming fileSystem API is not callable from Java
 

 Key: SPARK-3754
 URL: https://issues.apache.org/jira/browse/SPARK-3754
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.1.0
Reporter: holdenk
Assignee: Holden Karau

 The Spark Streaming Java API for fileSystem is not callable from Java. We 
 should do something like with how it is handled in the Java Spark Context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3754) Spark Streaming fileSystem API is not callable from Java


 [ 
https://issues.apache.org/jira/browse/SPARK-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-3754:
-
Priority: Critical  (was: Major)

 Spark Streaming fileSystem API is not callable from Java
 

 Key: SPARK-3754
 URL: https://issues.apache.org/jira/browse/SPARK-3754
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.1.0
Reporter: holdenk
Assignee: Holden Karau
Priority: Critical

 The Spark Streaming Java API for fileSystem is not callable from Java. We 
 should do something like with how it is handled in the Java Spark Context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3754) Spark Streaming fileSystem API is not callable from Java


 [ 
https://issues.apache.org/jira/browse/SPARK-3754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-3754:
-
Affects Version/s: 1.0.0
   1.1.0

 Spark Streaming fileSystem API is not callable from Java
 

 Key: SPARK-3754
 URL: https://issues.apache.org/jira/browse/SPARK-3754
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.1.0
Reporter: holdenk
Assignee: Holden Karau
Priority: Critical

 The Spark Streaming Java API for fileSystem is not callable from Java. We 
 should do something like with how it is handled in the Java Spark Context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class


 [ 
https://issues.apache.org/jira/browse/SPARK-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4303.
--
Resolution: Duplicate

I think this is an exact duplicate of 
https://issues.apache.org/jira/browse/SPARK-2465. See the discussion about why 
this won't be committed in the foreseeable future. I agree there are some 
arguments for long but fair enough there are downsides too.

 [MLLIB] Use Long IDs instead of Int in ALS.Rating class
 ---

 Key: SPARK-4303
 URL: https://issues.apache.org/jira/browse/SPARK-4303
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jia Xu

 In many big data recommendation applications, the IDs used for users and 
 products are usually Long type instead of Integer. 
 So a Rating class based on Long IDs should be more useful for these 
 applications.
 i.e. case class Rating(val user: Long, val product: Long, val rating: Double)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-4304) sortByKey() will fail on empty RDD

2014-11-07 Thread Davies Liu (JIRA)

Davies Liu created SPARK-4304:
-

 Summary: sortByKey() will fail on empty RDD
 Key: SPARK-4304
 URL: https://issues.apache.org/jira/browse/SPARK-4304
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0, 1.0.2, 1.2.0
Reporter: Davies Liu
Priority: Blocker


{code}
 sc.parallelize(zip(range(4), range(0)), 5).sortByKey().count()
Traceback (most recent call last):
  File stdin, line 1, in module
  File /Users/davies/work/spark/python/pyspark/rdd.py, line 532, in sortByKey
for i in range(0, numPartitions - 1)]
IndexError: list index out of range

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4304) sortByKey() will fail on empty RDD


[ 
https://issues.apache.org/jira/browse/SPARK-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203092#comment-14203092
 ] 

Apache Spark commented on SPARK-4304:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3162

 sortByKey() will fail on empty RDD
 --

 Key: SPARK-4304
 URL: https://issues.apache.org/jira/browse/SPARK-4304
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: Davies Liu
Priority: Blocker

 {code}
  sc.parallelize(zip(range(4), range(0)), 5).sortByKey().count()
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 532, in 
 sortByKey
 for i in range(0, numPartitions - 1)]
 IndexError: list index out of range
 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4304) sortByKey() will fail on empty RDD


[ 
https://issues.apache.org/jira/browse/SPARK-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203102#comment-14203102
 ] 

Apache Spark commented on SPARK-4304:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/3163

 sortByKey() will fail on empty RDD
 --

 Key: SPARK-4304
 URL: https://issues.apache.org/jira/browse/SPARK-4304
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: Davies Liu
Priority: Blocker

 {code}
  sc.parallelize(zip(range(4), range(0)), 5).sortByKey().count()
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 532, in 
 sortByKey
 for i in range(0, numPartitions - 1)]
 IndexError: list index out of range
 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4303) [MLLIB] Use Long IDs instead of Int in ALS.Rating class

2014-11-07 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203147#comment-14203147
 ] 

Matei Zaharia commented on SPARK-4303:
--

Yup, this will actually become easier with the new pipeline API, but it's 
probably not going to happen in 1.2.

 [MLLIB] Use Long IDs instead of Int in ALS.Rating class
 ---

 Key: SPARK-4303
 URL: https://issues.apache.org/jira/browse/SPARK-4303
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Jia Xu

 In many big data recommendation applications, the IDs used for users and 
 products are usually Long type instead of Integer. 
 So a Rating class based on Long IDs should be more useful for these 
 applications.
 i.e. case class Rating(val user: Long, val product: Long, val rating: Double)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4289) Creating an instance of Hadoop Job fails in the Spark shell when toString() is called on the instance.

2014-11-07 Thread Corey J. Nolet (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203183#comment-14203183
 ] 

Corey J. Nolet commented on SPARK-4289:
---

I suppose we could look @ it as a Hadoop issue- though newing up a Job works 
fine without the Scala shell doing the toString(). I'd have to dive in deeper 
to find out why the states seem to be different between the constructor and the 
toString()- and even more importantly, why it cares...

I think :silent will work for the short term.

 Creating an instance of Hadoop Job fails in the Spark shell when toString() 
 is called on the instance.
 --

 Key: SPARK-4289
 URL: https://issues.apache.org/jira/browse/SPARK-4289
 Project: Spark
  Issue Type: Bug
Reporter: Corey J. Nolet

 This one is easy to reproduce.
 preval job = new Job(sc.hadoopConfiguration)/pre
 I'm not sure what the solution would be off hand as it's happening when the 
 shell is calling toString() on the instance of Job. The problem is, because 
 of the failure, the instance is never actually assigned to the job val.
 java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
   at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
   at org.apache.hadoop.mapreduce.Job.toString(Job.java:452)
   at 
 scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
   at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
   at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
   at .init(console:10)
   at .clinit(console)
   at $print(console)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
   at 
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
   at 
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)
   at 
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
   at 
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
   at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
   at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
   at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
   at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
   at 
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
   at 
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
   at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
   at org.apache.spark.repl.Main$.main(Main.scala:31)
   at org.apache.spark.repl.Main.main(Main.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4304) sortByKey() will fail on empty RDD


 [ 
https://issues.apache.org/jira/browse/SPARK-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4304.
---
   Resolution: Fixed
Fix Version/s: 1.0.3
   1.1.1
   1.2.0

Issue resolved by pull request 3163
[https://github.com/apache/spark/pull/3163]

 sortByKey() will fail on empty RDD
 --

 Key: SPARK-4304
 URL: https://issues.apache.org/jira/browse/SPARK-4304
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: Davies Liu
Priority: Blocker
 Fix For: 1.2.0, 1.1.1, 1.0.3


 {code}
  sc.parallelize(zip(range(4), range(0)), 5).sortByKey().count()
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 532, in 
 sortByKey
 for i in range(0, numPartitions - 1)]
 IndexError: list index out of range
 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4304) sortByKey() will fail on empty RDD


 [ 
https://issues.apache.org/jira/browse/SPARK-4304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4304:
--
Assignee: Davies Liu

 sortByKey() will fail on empty RDD
 --

 Key: SPARK-4304
 URL: https://issues.apache.org/jira/browse/SPARK-4304
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2, 1.1.0, 1.2.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.1.1, 1.2.0, 1.0.3


 {code}
  sc.parallelize(zip(range(4), range(0)), 5).sortByKey().count()
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 532, in 
 sortByKey
 for i in range(0, numPartitions - 1)]
 IndexError: list index out of range
 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4221) Allow access to nonnegative ALS from python

2014-11-07 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4221.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 3095
[https://github.com/apache/spark/pull/3095]

 Allow access to nonnegative ALS from python
 ---

 Key: SPARK-4221
 URL: https://issues.apache.org/jira/browse/SPARK-4221
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Michelangelo D'Agostino
Assignee: Michelangelo D'Agostino
 Fix For: 1.2.0


 SPARK-1553 added alternating nonnegative least squares to MLLib, however it's 
 not possible to access it via the python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4221) Allow access to nonnegative ALS from python

2014-11-07 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4221:
-
Assignee: Michelangelo D'Agostino

 Allow access to nonnegative ALS from python
 ---

 Key: SPARK-4221
 URL: https://issues.apache.org/jira/browse/SPARK-4221
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Reporter: Michelangelo D'Agostino
Assignee: Michelangelo D'Agostino
 Fix For: 1.2.0


 SPARK-1553 added alternating nonnegative least squares to MLLib, however it's 
 not possible to access it via the python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2014-11-07 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-3821:

Attachment: packer-proposal.html

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2014-11-07 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14203280#comment-14203280
 ] 

Nicholas Chammas commented on SPARK-3821:
-

After much dilly-dallying, I am happy to present:
* A brief proposal / design doc ([fixed JIRA attachment | 
https://issues.apache.org/jira/secure/attachment/12680371/packer-proposal.html],
 [md file on GitHub | 
https://github.com/nchammas/spark-ec2/blob/packer/packer/proposal.md])
* [Initial implementation | 
https://github.com/nchammas/spark-ec2/tree/packer/packer] and [README | 
https://github.com/nchammas/spark-ec2/blob/packer/packer/README.md]
* New AMIs generated by this implementation: [Base AMIs | 
https://github.com/nchammas/spark-ec2/tree/packer/ami-list/base], [Spark 1.1.0 
Pre-Installed | 
https://github.com/nchammas/spark-ec2/tree/packer/ami-list/1.1.0]

To try out the new AMIs with {{spark-ec2}}, you'll need to update [these | 
https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L47]
 [two | 
https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L593]
 lines (well, really, just the first one) to point to [my {{spark-ec2}} repo on 
the {{packer}} branch | 
https://github.com/nchammas/spark-ec2/tree/packer/packer].

Your candid feedback and/or improvements are most welcome!

 Develop an automated way of creating Spark images (AMI, Docker, and others)
 ---

 Key: SPARK-3821
 URL: https://issues.apache.org/jira/browse/SPARK-3821
 Project: Spark
  Issue Type: Improvement
  Components: Build, EC2
Reporter: Nicholas Chammas
Assignee: Nicholas Chammas
 Attachments: packer-proposal.html


 Right now the creation of Spark AMIs or Docker containers is done manually. 
 With tools like [Packer|http://www.packer.io/], we should be able to automate 
 this work, and do so in such a way that multiple types of machine images can 
 be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4291) Drop Code from network module names

2014-11-07 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4291.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Drop Code from network module names
 -

 Key: SPARK-4291
 URL: https://issues.apache.org/jira/browse/SPARK-4291
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.2.0


 In maven, the network modules have the suffix Code, which is inconsistent 
 with other modules.
 {code}
 [INFO] Reactor Build Order:
 [INFO] 
 [INFO] Spark Project Parent POM
 [INFO] Spark Project Common Network Code
 [INFO] Spark Project Shuffle Streaming Service Code
 [INFO] Spark Project Core
 [INFO] Spark Project Bagel
 [INFO] Spark Project GraphX
 [INFO] Spark Project Streaming
 [INFO] Spark Project Catalyst
 [INFO] Spark Project SQL
 [INFO] Spark Project ML Library
 [INFO] Spark Project Tools
 [INFO] Spark Project Hive
 [INFO] Spark Project REPL
 [INFO] Spark Project YARN Parent POM
 [INFO] Spark Project YARN Stable API
 [INFO] Spark Project Assembly
 [INFO] Spark Project External Twitter
 [INFO] Spark Project External Kafka
 [INFO] Spark Project External Flume Sink
 [INFO] Spark Project External Flume
 [INFO] Spark Project External ZeroMQ
 [INFO] Spark Project External MQTT
 [INFO] Spark Project Examples
 [INFO] Spark Project Yarn Shuffle Service Code
 {code}
 My proposal is to drop it especially before they make it into an official 
 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3648) Provide a script for fetching remote PR's for review