[jira] [Commented] (SPARK-6468) Fix the race condition of subDirs in DiskBlockManager

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375873#comment-14375873
 ] 

Apache Spark commented on SPARK-6468:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/5136

 Fix the race condition of subDirs in DiskBlockManager
 -

 Key: SPARK-6468
 URL: https://issues.apache.org/jira/browse/SPARK-6468
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.3.0
Reporter: Shixiong Zhu
Priority: Minor

 There are two race conditions of subDirs in DiskBlockManager:
 1. `getAllFiles` does not use correct locks to read the contents in 
 `subDirs`. Although it's designed for testing, it's still worth to add 
 correct locks to eliminate the race condition.
 2. The double-check has a race condition in `getFile(filename: String)`. If a 
 thread finds `subDirs(dirId)(subDirId)` is not null out of the `synchronized` 
 block, it may not be able to see the correct content of the File instance 
 pointed by `subDirs(dirId)(subDirId)` according to the Java memory model 
 (there is no volatile variable here).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6468) Fix the race condition of subDirs in DiskBlockManager

2015-03-23 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-6468:
---

 Summary: Fix the race condition of subDirs in DiskBlockManager
 Key: SPARK-6468
 URL: https://issues.apache.org/jira/browse/SPARK-6468
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.3.0
Reporter: Shixiong Zhu
Priority: Minor


There are two race conditions of subDirs in DiskBlockManager:

1. `getAllFiles` does not use correct locks to read the contents in `subDirs`. 
Although it's designed for testing, it's still worth to add correct locks to 
eliminate the race condition.

2. The double-check has a race condition in `getFile(filename: String)`. If a 
thread finds `subDirs(dirId)(subDirId)` is not null out of the `synchronized` 
block, it may not be able to see the correct content of the File instance 
pointed by `subDirs(dirId)(subDirId)` according to the Java memory model (there 
is no volatile variable here).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters

2015-03-23 Thread Theodore Vasiloudis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375766#comment-14375766
 ] 

Theodore Vasiloudis commented on SPARK-2394:


Just adding some more info here for people who end up here through searches:

Steps 1-3 can be completed by running this script on each machine on you 
cluster:

https://gist.github.com/thvasilo/7696d21cb3205f5cb11d

There should be an easy way to execute this script when the cluster is being 
launched, I tried using the --user-data flag but that doesn't seem to do that. 
Otherwise you'd have to rsync this script into each machine (easy, use 
~/spark-ec2/copy-dir after you've copied the file to you master) and then ssh 
into each machine and run it (not so easy)

For Step 4, make sure that the core-site.xml in changed in both the hadoop 
config, as well as the spark-conf/ directory. Also as suggested in the 
hadoop-lzo docs 

{quote}
Note that there seems to be a bug in /path/to/hadoop/bin/hadoop; comment out 
the line:

{code}
JAVA_LIBRARY_PATH=''
{code}

{quote}

Here's how I set the vars in spark-env.sh:

{code}
export 
SPARK_SUBMIT_LIBRARY_PATH=$SPARK_SUBMIT_LIBRARY_PATH:/root/persistent-hdfs/lib/native/:/root/hadoop-native:/root/hadoop-lzo/target/native/Linux-amd64-64/lib:/usr/lib64/
export 
SPARK_SUBMIT_CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar
{code}

And what I added to both core-site.xml

{code:xml}
property
nameio.compression.codecs/name

valueorg.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec/value
  /property

  property
nameio.compression.codec.lzo.class/name
valuecom.hadoop.compression.lzo.LzoCodec/value
  /property
{code}


As for the code (Step 5) itself, I've tried the different variations suggested 
in the mailing list and other places and ended up using the following:

https://gist.github.com/thvasilo/cd99709eacb44c8a8cff

Note that this uses the sequenceFile reader, specifically for the Google 
Ngrams. Setting the minPartitions is important in order to get good 
parallelization with what you with the data later on. (3*cores in your cluster 
seems like a good value)

You can run the above job using:

{code}
./bin/spark-submit --jars 
local:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar --class 
your.package.here.TestNgrams --master $SPARK_MASTER $SPARK_JAR dummy_arg
{code}

you should of course set the env variables for you spark master and the 
location of your fat jar.
Note that I'm passing the hadoop-lzo jar as local, that assumes that every node 
has built the jar, which is done by the script given above.

Do the above and you should get the count and the first line of the data when 
running the job.

 Make it easier to read LZO-compressed files from EC2 clusters
 -

 Key: SPARK-2394
 URL: https://issues.apache.org/jira/browse/SPARK-2394
 Project: Spark
  Issue Type: Improvement
  Components: EC2, Input/Output
Affects Versions: 1.0.0
Reporter: Nicholas Chammas
Priority: Minor
  Labels: compression

 Amazon hosts [a large Google n-grams data set on 
 S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is 
 perfect, among other things, for putting together interesting and easily 
 reproducible public demos of Spark's capabilities.
 The problem is that the data set is compressed using LZO, and it is currently 
 more painful than it should be to get your average {{spark-ec2}} cluster to 
 read input compressed in this way.
 This is what one has to go through to get a Spark cluster created with 
 {{spark-ec2}} to read LZO-compressed files:
 # Install the latest LZO release, perhaps via {{yum}}.
 # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build 
 it. To build {{hadoop-lzo}} you need Maven. 
 # Install Maven. For some reason, [you cannot install Maven with 
 {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum],
  so install it manually.
 # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate 
 configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E].
 # Make [the appropriate 
 calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E]
  to {{sc.newAPIHadoopFile}}.
 This seems like a bit too much work for what we're trying to accomplish.
 If we expect this to be a common pattern -- reading LZO-compressed files from 
 a {{spark-ec2}} cluster -- it would be great if we could 

[jira] [Comment Edited] (SPARK-1702) Mesos executor won't start because of a ClassNotFoundException

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375599#comment-14375599
 ] 

Littlestar edited comment on SPARK-1702 at 3/23/15 11:00 AM:
-

I met this on spak 1.3.0 + mesos 0.21.1 with run-example SparkPi

Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/spark/executor/MesosExecutorBackend
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.executor.MesosExecutorBackend
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
Could not find the main class: org.apache.spark.executor.MesosExecutorBackend


was (Author: cnstar9988):
I met this on spak 1.3.0 + mesos 0.21.1 with run-example SparkPi

 Mesos executor won't start because of a ClassNotFoundException
 --

 Key: SPARK-1702
 URL: https://issues.apache.org/jira/browse/SPARK-1702
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Bouke van der Bijl
  Labels: executors, mesos, spark

 Some discussion here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html
 Fix here (which is probably not the right fix): 
 https://github.com/apache/spark/pull/620
 This was broken in v0.9.0, was fixed in v0.9.1 and is now broken again.
 Error in Mesos executor stderr:
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0502 17:31:42.672224 14688 exec.cpp:131] Version: 0.18.0
 I0502 17:31:42.674959 14707 exec.cpp:205] Executor registered on slave 
 20140501-182306-16842879-5050-10155-0
 14/05/02 17:31:42 INFO MesosExecutorBackend: Using Spark's default log4j 
 profile: org/apache/spark/log4j-defaults.properties
 14/05/02 17:31:42 INFO MesosExecutorBackend: Registered with Mesos as 
 executor ID 20140501-182306-16842879-5050-10155-0
 14/05/02 17:31:43 INFO SecurityManager: Changing view acls to: vagrant
 14/05/02 17:31:43 INFO SecurityManager: SecurityManager, is authentication 
 enabled: false are ui acls enabled: false users with view permissions: 
 Set(vagrant)
 14/05/02 17:31:43 INFO Slf4jLogger: Slf4jLogger started
 14/05/02 17:31:43 INFO Remoting: Starting remoting
 14/05/02 17:31:43 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@localhost:50843]
 14/05/02 17:31:43 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@localhost:50843]
 java.lang.ClassNotFoundException: org/apache/spark/serializer/JavaSerializer
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:165)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:176)
 at org.apache.spark.executor.Executor.init(Executor.scala:106)
 at 
 org.apache.spark.executor.MesosExecutorBackend.registered(MesosExecutorBackend.scala:56)
 Exception in thread Thread-0 I0502 17:31:43.710039 14707 exec.cpp:412] 
 Deactivating the executor libprocess
 The problem is that it can't find the class. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters

2015-03-23 Thread Theodore Vasiloudis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375766#comment-14375766
 ] 

Theodore Vasiloudis edited comment on SPARK-2394 at 3/23/15 11:38 AM:
--

Just adding some more info here for people who end up here through searches:

Steps 1-3 can be completed by running this script on each machine on you 
cluster:

https://gist.github.com/thvasilo/7696d21cb3205f5cb11d

There should be an easy way to execute this script when the cluster is being 
launched, I tried using the --user-data flag but that doesn't seem to do that. 
Otherwise you'd have to rsync this script into each machine (easy, use 
~/spark-ec2/copy-dir after you've copied the file to you master) and then ssh 
into each machine and run it (not so easy)

For Step 4, make sure that the core-site.xml in changed in both the hadoop 
config, as well as the spark-conf/ directory. Also as suggested in the 
hadoop-lzo docs 

{quote}
Note that there seems to be a bug in /path/to/hadoop/bin/hadoop; comment out 
the line:

{code}
JAVA_LIBRARY_PATH=''
{code}

{quote}

Here's how I set the vars in spark-env.sh:

{code}
export 
SPARK_SUBMIT_LIBRARY_PATH=$SPARK_SUBMIT_LIBRARY_PATH:/root/persistent-hdfs/lib/native/:/root/hadoop-native:/root/hadoop-lzo/target/native/Linux-amd64-64/lib:/usr/lib64/
export 
SPARK_SUBMIT_CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar
{code}

And what I added to both core-site.xml

{code:xml}
property
nameio.compression.codecs/name

valueorg.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.BZip2Codec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec/value
  /property

  property
nameio.compression.codec.lzo.class/name
valuecom.hadoop.compression.lzo.LzoCodec/value
  /property
{code}

Here is an easy way to test if everything works (replace ephemeral with 
persistent if you are using that):

{code}
echo hello world  test.log
lzop test.log
ephemeral-hdfs/bin/hadoop fs -copyFromLocal test.log.lzo /user/root/test.log.lzo
#Test local
ephemeral-hdfs/bin/hadoop jar 
/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar 
com.hadoop.compression.lzo.LzoIndexer /user/root/test.log.lzo
#Test distributed
ephemeral-hdfs/bin/hadoop jar 
/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar 
com.hadoop.compression.lzo.DistributedLzoIndexer /user/root/test.log.lzo
{code}


As for the code (Step 5) itself, I've tried the different variations suggested 
in the mailing list and other places and ended up using the following:

https://gist.github.com/thvasilo/cd99709eacb44c8a8cff

Note that this uses the sequenceFile reader, specifically for the Google 
Ngrams. Setting the minPartitions is important in order to get good 
parallelization with what you with the data later on. (3*cores in your cluster 
seems like a good value)

You can run the above job using:

{code}
./bin/spark-submit --jars 
local:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar --class 
your.package.here.TestNgrams --master $SPARK_MASTER $SPARK_JAR dummy_arg
{code}

you should of course set the env variables for you spark master and the 
location of your fat jar.
Note that I'm passing the hadoop-lzo jar as local, that assumes that every node 
has built the jar, which is done by the script given above.

Do the above and you should get the count and the first line of the data when 
running the job.


was (Author: tvas):
Just adding some more info here for people who end up here through searches:

Steps 1-3 can be completed by running this script on each machine on you 
cluster:

https://gist.github.com/thvasilo/7696d21cb3205f5cb11d

There should be an easy way to execute this script when the cluster is being 
launched, I tried using the --user-data flag but that doesn't seem to do that. 
Otherwise you'd have to rsync this script into each machine (easy, use 
~/spark-ec2/copy-dir after you've copied the file to you master) and then ssh 
into each machine and run it (not so easy)

For Step 4, make sure that the core-site.xml in changed in both the hadoop 
config, as well as the spark-conf/ directory. Also as suggested in the 
hadoop-lzo docs 

{quote}
Note that there seems to be a bug in /path/to/hadoop/bin/hadoop; comment out 
the line:

{code}
JAVA_LIBRARY_PATH=''
{code}

{quote}

Here's how I set the vars in spark-env.sh:

{code}
export 
SPARK_SUBMIT_LIBRARY_PATH=$SPARK_SUBMIT_LIBRARY_PATH:/root/persistent-hdfs/lib/native/:/root/hadoop-native:/root/hadoop-lzo/target/native/Linux-amd64-64/lib:/usr/lib64/
export 
SPARK_SUBMIT_CLASSPATH=$SPARK_CLASSPATH:$SPARK_SUBMIT_CLASSPATH:/root/hadoop-lzo/target/hadoop-lzo-0.4.20-SNAPSHOT.jar
{code}

And what I added to both core-site.xml

{code:xml}
property
nameio.compression.codecs/name


[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath

2015-03-23 Thread vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375707#comment-14375707
 ] 

vijay commented on SPARK-6435:
--

I tested this on Linux with the 1.3.0 release, works fine.  Apparently a 
windows-specific issue.  Apparently on windows only the 1st jar is picked up.  
This appears to be a problem with parsing the command line, introduced by the 
change in windows scripts between 1.2.0 and 1.3.0.  A simple fix to 
bin\windows-utils.cmd resolves the issue.

I ran this command to test with 'real' jars:
{code}
%SPARK_HOME%\bin\spark-shell --master local --jars 
c:\code\elasticsearch-1.4.2\lib\lucene-core-4.10.2.jar,c:\temp\guava-14.0.1.jar
{code}

Here are some snippets from the console - note that only the 1st jar is added; 
I can load classes from the 1st jar but not the 2nd:
{code}
15/03/23 10:57:41 INFO SparkUI: Started SparkUI at http://vgarla-t440P.fritz.box
:4040
15/03/23 10:57:41 INFO SparkContext: Added JAR 
file:/c:/code/elasticsearch-1.4.2/lib/lucene-core-4.10.2.jar at 
http://192.168.178.41:54601/jars/lucene-core-4.10.2.jar with timestamp 
1427104661969
15/03/23 10:57:42 INFO Executor: Starting executor ID driver on host localhost
...
scala import org.apache.lucene.util.IOUtils
import org.apache.lucene.util.IOUtils

scala import com.google.common.base.Strings
console:20: error: object Strings is not a member of package 
com.google.common.base
{code}

Looking at the command line in jvisualvm, I see that only the 1st jar is aded:
{code}
Main class: org.apache.spark.deploy.SparkSubmit
Arguments: --class org.apache.spark.repl.Main --master local --jars 
c:\code\elasticsearch-1.4.2\lib\lucene-core-4.10.2.jar spark-shell 
c:\temp\guava-14.0.1.jar
{code}
In spark 1.2.0, spark-shell2.cmd just passed arguments as is to the java 
command line:
{code}
cmd /V /E /C %SPARK_HOME%\bin\spark-submit.cmd --class 
org.apache.spark.repl.Main %* spark-shell
{code}

In spark 1.3.0, spark-shell2.cmd calls windows-utils.cmd to parse arguments 
into SUBMISSION_OPTS and APPLICATION_OPTS.  Only the first jar in the list 
passed to --jars makes it into the SUBMISSION_OPTS; latter jars are added to 
APPLICATION_OPTS:
{code}
call %SPARK_HOME%\bin\windows-utils.cmd %*
if %ERRORLEVEL% equ 1 (
  call :usage
  exit /b 1
)
echo SUBMISSION_OPTS=%SUBMISSION_OPTS%
echo APPLICATION_OPTS=%APPLICATION_OPTS%

cmd /V /E /C %SPARK_HOME%\bin\spark-submit.cmd --class 
org.apache.spark.repl.Main %SUBMISSION_OPTS% spark-shell %APPLICATION_OPTS%
{code}

The problem is that by the time the command line arguments get to 
windows-utils.cmd, the windows command line processor has split the 
comma-separated list into distinct arguments.  The windows way of saying treat 
this as a single arg is to surround in double-quotes.  However, when I 
surround the jars in quotes, I get an error:
{code}
%SPARK_HOME%\bin\spark-shell --master local --jars 
c:\code\elasticsearch-1.4.2\lib\lucene-core-4.10.2.jar,c:\temp\guava-14.0.1.jar
c:\temp\guava-14.0.1.jar==x was unexpected at this time.
{code}
Digging in, I see this is caused by this line from windows-utils.cmd:
{code}
  if x%2==x (
{code}

Replacing the quotes with square brackets does the trick:
{code}
  if [x%2]==[x] (
{code}

Now the command line is processed correctly.



 spark-shell --jars option does not add all jars to classpath
 

 Key: SPARK-6435
 URL: https://issues.apache.org/jira/browse/SPARK-6435
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.0
 Environment: Win64
Reporter: vijay

 Not all jars supplied via the --jars option will be added to the driver (and 
 presumably executor) classpath.  The first jar(s) will be added, but not all.
 To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
 then try to import a class from the last jar.  This fails.  A simple 
 reproducer: 
 Create a bunch of dummy jars:
 jar cfM jar1.jar log.txt
 jar cfM jar2.jar log.txt
 jar cfM jar3.jar log.txt
 jar cfM jar4.jar log.txt
 Start the spark-shell with the dummy jars and guava at the end:
 %SPARK_HOME%\bin\spark-shell --master local --jars 
 jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
 In the shell, try importing from guava; you'll get an error:
 {code}
 scala import com.google.common.base.Strings
 console:19: error: object Strings is not a member of package 
 com.google.common.base
import com.google.common.base.Strings
   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2167) spark-submit should return exit code based on failure/success

2015-03-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376001#comment-14376001
 ] 

Sean Owen commented on SPARK-2167:
--

(Thanks [~tgraves] for having a look at some of these older issues. You'd know 
a lot about what's still in play or likely obsolete.)

 spark-submit should return exit code based on failure/success
 -

 Key: SPARK-2167
 URL: https://issues.apache.org/jira/browse/SPARK-2167
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Guoqiang Li

 spark-submit script and Java class should exit with 0 for success and 
 non-zero with failure so that other command line tools and workflow managers 
 (like oozie) can properly tell if the spark app succeeded or failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6436) io/netty missing from external shuffle service jars for yarn

2015-03-23 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves closed SPARK-6436.

Resolution: Invalid

This is working for me.  Sorry for the confusion, I had build environment 
issues.

 io/netty missing from external shuffle service jars for yarn
 

 Key: SPARK-6436
 URL: https://issues.apache.org/jira/browse/SPARK-6436
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, YARN
Affects Versions: 1.3.0
Reporter: Thomas Graves

 I was trying to use the external shuffle service on yarn but it appears that 
 io/netty isn't included in the network jars.  I loaded up network-common, 
 network-yarn, and network-shuffle.  If there is some other jar supposed to be 
 included please let me know.
 2015-03-20 14:25:07,142 [main] FATAL 
 org.apache.hadoop.yarn.server.nodemanager.NodeManager: Error starting 
 NodeManager
 java.lang.NoClassDefFoundError: io/netty/channel/EventLoopGroup
 at 
 org.apache.spark.network.shuffle.ExternalShuffleBlockManager.init(ExternalShuffleBlockManager.java:64)
 at 
 org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.init(ExternalShuffleBlockHandler.java:53)
 at 
 org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:105)
 at 
 org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
 at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:143)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode

2015-03-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376009#comment-14376009
 ] 

Christophe PRÉAUD commented on SPARK-6469:
--

Sorry if I'm saying something stupid, but I would expect {{LOCAL_DIRS}} 
(according to Spark comments in 
[Utils.scala|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L749],
 {{YARN_LOCAL_DIRS}} is for Hadoop 0.23, and {{LOCAL_DIRS}} for Hadoop 2.X) to 
be set in yarn-client mode.

 Local directories configured for YARN are not used in yarn-client mode
 --

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 spark.local.dir which should be ignored.
 If this works correctly in yarn-cluster mode, I've found out that it is not 
 the case in yarn-client mode.
 The problem seems to originate in the method 
 [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686].
 Indeed, I've checked with a simple application that the {{CONTAINER_ID}} 
 environment variable is correctly set in yarn-cluster mode (to something like 
 {{container_142761810_0151_01_01}}, but not in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4227) Document external shuffle service

2015-03-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376022#comment-14376022
 ] 

Thomas Graves commented on SPARK-4227:
--

Looks like I had build issues. The instructions 
http://spark.apache.org/docs/1.3.0/job-scheduling.html work. 

 Document external shuffle service
 -

 Key: SPARK-4227
 URL: https://issues.apache.org/jira/browse/SPARK-4227
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Critical

 We should add spark.shuffle.service.enabled to the Configuration page and 
 give instructions for launching the shuffle service as an auxiliary service 
 on YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode

2015-03-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376039#comment-14376039
 ] 

Christophe PRÉAUD commented on SPARK-6469:
--

Not exactly, sorry if this was not clear from my description:
* when I am running YARN on Hadoop 2 in *cluster* mode, both {{LOCAL_DIRS}} and 
{{CONTAINER_ID}} are correctly set.
* when I am running YARN on Hadoop 2 in *client* mode, neither {{LOCAL_DIRS}} 
nor {{CONTAINER_ID}} is correctly set.

 Local directories configured for YARN are not used in yarn-client mode
 --

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 spark.local.dir which should be ignored.
 If this works correctly in yarn-cluster mode, I've found out that it is not 
 the case in yarn-client mode.
 The problem seems to originate in the method 
 [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686].
 Indeed, I've checked with a simple application that the {{CONTAINER_ID}} 
 environment variable is correctly set in yarn-cluster mode (to something like 
 {{container_142761810_0151_01_01}}, but not in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6255) Python MLlib API missing items: Classification

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375994#comment-14375994
 ] 

Apache Spark commented on SPARK-6255:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/5137

 Python MLlib API missing items: Classification
 --

 Key: SPARK-6255
 URL: https://issues.apache.org/jira/browse/SPARK-6255
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 LogisticRegressionWithLBFGS
 * setNumClasses
 * setValidateData
 LogisticRegressionModel
 * getThreshold
 * numClasses
 * numFeatures
 SVMWithSGD
 * setValidateData
 SVMModel
 * getThreshold



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode

2015-03-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376017#comment-14376017
 ] 

Sean Owen commented on SPARK-6469:
--

Ah I get you, I think you have a point. So, you are running in YARN on Hadoop 2 
in cluster mode and neither {{YARN_LOCAL_DIRS}} or {{CONTAINER_ID}} is set. 
Paging [~sandyr] [~tgraves] [~vanzin] for thoughts on whether that's to be 
expected, not, or means a check here has to be adjusted.

 Local directories configured for YARN are not used in yarn-client mode
 --

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 spark.local.dir which should be ignored.
 If this works correctly in yarn-cluster mode, I've found out that it is not 
 the case in yarn-client mode.
 The problem seems to originate in the method 
 [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686].
 Indeed, I've checked with a simple application that the {{CONTAINER_ID}} 
 environment variable is correctly set in yarn-cluster mode (to something like 
 {{container_142761810_0151_01_01}}, but not in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3735) Sending the factor directly or AtA based on the cost in ALS

2015-03-23 Thread Debasish Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376046#comment-14376046
 ] 

Debasish Das commented on SPARK-3735:
-

We might want to consider doing some of these things through indexed RDD 
exposed through an API...right now ALS is completely join based...can we do 
something nicer if we have access to an efficient read only cache from ALS 
mapPartitions...Idea here is to think about zeros explicitly and not adding the 
implicit heuristic which is generally hard to tune... 

 Sending the factor directly or AtA based on the cost in ALS
 ---

 Key: SPARK-3735
 URL: https://issues.apache.org/jira/browse/SPARK-3735
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 It is common to have some super popular products in the dataset. In this 
 case, sending many user factors to the target product block could be more 
 expensive than sending the normal equation `\sum_i u_i u_i^T` and `\sum_i u_i 
 r_ij` to the product block. The cost of sending a single factor is `k`, while 
 the cost of sending a normal equation is much more expensive, `k * (k + 3) / 
 2`. However, if we use normal equation for all products associated with a 
 user, we don't need to send this user factor.
 Determining the optimal assignment is hard. But we could use a simple 
 heuristic. Inside any rating block,
 1) order the product ids by the number of user ids associated with them in 
 desc order
 2) starting from the most popular product, mark popular products as use 
 normal eq and calculate the cost
 Remember the best assignment that comes with the lowest cost and use it for 
 computation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6463) AttributeSet.equal should compare size

2015-03-23 Thread June (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

June updated SPARK-6463:

Summary: AttributeSet.equal should compare size  (was: [SPARK][SQL] 
AttributeSet.equal should compare size)

 AttributeSet.equal should compare size
 --

 Key: SPARK-6463
 URL: https://issues.apache.org/jira/browse/SPARK-6463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: June
Priority: Minor

 AttributeSet.equal should compare both member and size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375580#comment-14375580
 ] 

Littlestar edited comment on SPARK-6461 at 3/23/15 9:04 AM:


when I add MESOS_HADOOP_CONF_DIR at all mesos-master-env.sh and 
mesos-slave-env.sh , It throws the following error.
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/spark/executor/MesosExecutorBackend
 Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.executor.MesosExecutorBackend

similar to https://issues.apache.org/jira/browse/SPARK-1702


was (Author: cnstar9988):
when I add MESOS_HADOOP_CONF_DIR at all mesos-master-env.sh and 
mesos-slave-env.sh , It throws the following error.
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/spark/executor/MesosExecutorBackend
 Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.executor.MesosExecutorBackend

similar to https://github.com/apache/spark/pull/620

 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 --

 Key: SPARK-6461
 URL: https://issues.apache.org/jira/browse/SPARK-6461
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.3.0
Reporter: Littlestar

 I use mesos run spak 1.3.0 ./run-example SparkPi
 but failed.
 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 spark.executorEnv.PATH
 spark.executorEnv.HADOOP_HOME
 spark.executorEnv.JAVA_HOME
 E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop 
 fs -copyToLocal 
 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' 
 '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz'
 sh: hadoop: command not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode

2015-03-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375976#comment-14375976
 ] 

Sean Owen commented on SPARK-6469:
--

So, if {{YARN_LOCAL_DIRS}} is set, then {{isRunningInYarnContainer}} is 
{{true}} and it uses this for the local dir. {{CONTAINER_ID}} is not relevant 
to this. What local directory are you expecting it to use, if 
{{YARN_LOCAL_DIRS}} isn't set?

 Local directories configured for YARN are not used in yarn-client mode
 --

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 spark.local.dir which should be ignored.
 If this works correctly in yarn-cluster mode, I've found out that it is not 
 the case in yarn-client mode.
 The problem seems to originate in the method 
 [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686].
 Indeed, I've checked with a simple application that the {{CONTAINER_ID}} 
 environment variable is correctly set in yarn-cluster mode (to something like 
 {{container_142761810_0151_01_01}}, but not in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS

2015-03-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375905#comment-14375905
 ] 

Thomas Graves commented on SPARK-6449:
--

[~rdub] Was there an exception in the log higher up? Wondering if it shows the 
entire exception for the out of memory.

 Driver OOM results in reported application result SUCCESS
 -

 Key: SPARK-6449
 URL: https://issues.apache.org/jira/browse/SPARK-6449
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
Reporter: Ryan Williams

 I ran a job yesterday that according to the History Server and YARN RM 
 finished with status {{SUCCESS}}.
 Clicking around on the history server UI, there were too few stages run, and 
 I couldn't figure out why that would have been.
 Finally, inspecting the end of the driver's logs, I saw:
 {code}
 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Shutting down remote daemon.
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Remote daemon shut down; proceeding with flushing remote transports.
 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext
 Exception in thread Driver scala.MatchError: java.lang.OutOfMemoryError: GC 
 overhead limit exceeded (of class java.lang.OutOfMemoryError)
 at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
 exitCode: 0, (reason: Shutdown hook called before final status was reported.)
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering 
 ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before 
 final status was reported.)
 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
 Remoting shut down.
 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be 
 successfully unregistered.
 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory 
 .sparkStaging/application_1426705269584_0055
 {code}
 The driver OOM'd, [the {{catch}} block that presumably should have caught 
 it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484]
  threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and 
 written to the event log.
 This should be logged as a failed job and reported as such to YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode

2015-03-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375941#comment-14375941
 ] 

Christophe PRÉAUD edited comment on SPARK-6469 at 3/23/15 2:16 PM:
---

Attached a simple application to check the value of the {{CONTAINER_ID}} 
environment variable.

* Check in yarn-cluster mode
{code}
/opt/spark/bin/spark-submit --master yarn-cluster --class TestYarnVars 
testyarnvars_2.10-1.0.jar 2/dev/null
{code}
(the stdout of the application on the YARN web ui reads: {{CONTAINER_ID: 
container_142761810_0151_01_01}}

* Check in yarn-client mode:
{code}
/opt/spark/bin/spark-submit --master yarn-client --class TestYarnVars 
testyarnvars_2.10-1.0.jar 2/dev/null
{code}
CONTAINER_ID: null


was (Author: preaudc):
Attached a simple application to check the value of the {{CONTAINER_ID}} 
environment variable.

* Check in yarn-cluster mode
{code}
/opt/spark/bin/spark-submit --master yarn-cluster --class TestYarnVars --queue 
spark-batch testyarnvars_2.10-1.0.jar 2/dev/null
{code}
(the stdout of the application on the YARN web ui reads: {{CONTAINER_ID: 
container_142761810_0151_01_01}}

* Check in yarn-client mode:
{code}
/opt/spark/bin/spark-submit --master yarn-client --class TestYarnVars --queue 
spark-batch testyarnvars_2.10-1.0.jar 2/dev/null
{code}
CONTAINER_ID: null

 Local directories configured for YARN are not used in yarn-client mode
 --

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 spark.local.dir which should be ignored.
 If this works correctly in yarn-cluster mode, I've found out that it is not 
 the case in yarn-client mode.
 The problem seems to originate in the method 
 [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686].
 Indeed, I've checked with a simple application that the {{CONTAINER_ID}} 
 environment variable is correctly set in yarn-cluster mode (to something like 
 {{container_142761810_0151_01_01}}, but not in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode

2015-03-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375941#comment-14375941
 ] 

Christophe PRÉAUD edited comment on SPARK-6469 at 3/23/15 2:15 PM:
---

Attached a simple application to check the value of the {{CONTAINER_ID}} 
environment variable.

* Check in yarn-cluster mode
{code}
/opt/spark/bin/spark-submit --master yarn-cluster --class TestYarnVars --queue 
spark-batch testyarnvars_2.10-1.0.jar 2/dev/null
{code}
(the stdout of the application on the YARN web ui reads: {{CONTAINER_ID: 
container_142761810_0151_01_01}}

* Check in yarn-client mode:
{code}
/opt/spark/bin/spark-submit --master yarn-client --class TestYarnVars --queue 
spark-batch testyarnvars_2.10-1.0.jar 2/dev/null
{code}
CONTAINER_ID: null


was (Author: preaudc):
Attached a simple application to check the value of the {{CONTAINER_ID}} 
environment variable.

* Check in yarn-cluster mode
{code}
/opt/spark/bin/spark-submit --master yarn-cluster --class TestYarnVars --queue 
spark-batch testyarnvars_2.10-1.0.jar 2/dev/null
{code}
(the stdout of the application on the YARN wen ui reads: {{CONTAINER_ID: 
container_142761810_0151_01_01}}

* Check in yarn-client mode:
{code}
/opt/spark/bin/spark-submit --master yarn-client --class TestYarnVars --queue 
spark-batch testyarnvars_2.10-1.0.jar 2/dev/null
{code}
CONTAINER_ID: null

 Local directories configured for YARN are not used in yarn-client mode
 --

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 spark.local.dir which should be ignored.
 If this works correctly in yarn-cluster mode, I've found out that it is not 
 the case in yarn-client mode.
 The problem seems to originate in the method 
 [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686].
 Indeed, I've checked with a simple application that the {{CONTAINER_ID}} 
 environment variable is correctly set in yarn-cluster mode (to something like 
 {{container_142761810_0151_01_01}}, but not in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode

2015-03-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christophe PRÉAUD updated SPARK-6469:
-
Attachment: TestYarnVars.scala

Attached a simple application to check the value of the {{CONTAINER_ID}} 
environment variable.

* Check in yarn-cluster mode
{code}
/opt/spark/bin/spark-submit --master yarn-cluster --class TestYarnVars --queue 
spark-batch testyarnvars_2.10-1.0.jar 2/dev/null
{code}
(the stdout of the application on the YARN wen ui reads: {{CONTAINER_ID: 
container_142761810_0151_01_01}}

* Check in yarn-client mode:
{code}
/opt/spark/bin/spark-submit --master yarn-client --class TestYarnVars --queue 
spark-batch testyarnvars_2.10-1.0.jar 2/dev/null
{code}
CONTAINER_ID: null

 Local directories configured for YARN are not used in yarn-client mode
 --

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 spark.local.dir which should be ignored.
 If this works correctly in yarn-cluster mode, I've found out that it is not 
 the case in yarn-client mode.
 The problem seems to originate in the method 
 [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686].
 Indeed, I've checked with a simple application that the {{CONTAINER_ID}} 
 environment variable is correctly set in yarn-cluster mode (to something like 
 {{container_142761810_0151_01_01}}, but not in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode

2015-03-23 Thread JIRA
Christophe PRÉAUD created SPARK-6469:


 Summary: Local directories configured for YARN are not used in 
yarn-client mode
 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor


According to the [Spark YARN doc 
page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], 
Spark executors will use the local directories configured for YARN, not 
spark.local.dir which should be ignored.

If this works correctly in yarn-cluster mode, I've found out that it is not the 
case in yarn-client mode.
The problem seems to originate in the method 
[isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686].

Indeed, I've checked with a simple application that the {{CONTAINER_ID}} 
environment variable is correctly set in yarn-cluster mode (to something like 
{{container_142761810_0151_01_01}}, but not in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2394) Make it easier to read LZO-compressed files from EC2 clusters

2015-03-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375929#comment-14375929
 ] 

Nicholas Chammas commented on SPARK-2394:
-

Thank you for posting this information for others!

 Make it easier to read LZO-compressed files from EC2 clusters
 -

 Key: SPARK-2394
 URL: https://issues.apache.org/jira/browse/SPARK-2394
 Project: Spark
  Issue Type: Improvement
  Components: EC2, Input/Output
Affects Versions: 1.0.0
Reporter: Nicholas Chammas
Priority: Minor
  Labels: compression

 Amazon hosts [a large Google n-grams data set on 
 S3|https://aws.amazon.com/datasets/8172056142375670]. This data set is 
 perfect, among other things, for putting together interesting and easily 
 reproducible public demos of Spark's capabilities.
 The problem is that the data set is compressed using LZO, and it is currently 
 more painful than it should be to get your average {{spark-ec2}} cluster to 
 read input compressed in this way.
 This is what one has to go through to get a Spark cluster created with 
 {{spark-ec2}} to read LZO-compressed files:
 # Install the latest LZO release, perhaps via {{yum}}.
 # Download [{{hadoop-lzo}}|https://github.com/twitter/hadoop-lzo] and build 
 it. To build {{hadoop-lzo}} you need Maven. 
 # Install Maven. For some reason, [you cannot install Maven with 
 {{yum}}|http://stackoverflow.com/questions/7532928/how-do-i-install-maven-with-yum],
  so install it manually.
 # Update your {{core-site.xml}} and {{spark-env.sh}} with [the appropriate 
 configs|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3cca+-p3aga6f86qcsowp7k_r+8r-dgbmj3gz+4xljzjpr90db...@mail.gmail.com%3E].
 # Make [the appropriate 
 calls|http://mail-archives.apache.org/mod_mbox/spark-user/201312.mbox/%3CCA+-p3AGSPeNE5miQRFHC7-ZwNbicaXfh1-ZXdKJ=saw_mgr...@mail.gmail.com%3E]
  to {{sc.newAPIHadoopFile}}.
 This seems like a bit too much work for what we're trying to accomplish.
 If we expect this to be a common pattern -- reading LZO-compressed files from 
 a {{spark-ec2}} cluster -- it would be great if we could somehow make this 
 less painful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6255) Python MLlib API missing items: Classification

2015-03-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376095#comment-14376095
 ] 

Yanbo Liang commented on SPARK-6255:


[~josephkb] Can you assign it to me?

 Python MLlib API missing items: Classification
 --

 Key: SPARK-6255
 URL: https://issues.apache.org/jira/browse/SPARK-6255
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 LogisticRegressionWithLBFGS
 * setNumClasses
 * setValidateData
 LogisticRegressionModel
 * getThreshold
 * numClasses
 * numFeatures
 SVMWithSGD
 * setValidateData
 SVMModel
 * getThreshold



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6451) Support CombineSum in Code Gen

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376088#comment-14376088
 ] 

Apache Spark commented on SPARK-6451:
-

User 'gvramana' has created a pull request for this issue:
https://github.com/apache/spark/pull/5138

 Support CombineSum in Code Gen
 --

 Key: SPARK-6451
 URL: https://issues.apache.org/jira/browse/SPARK-6451
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker

 Since we are using CombineSum at the reducer side for the SUM function, we 
 need to make it work in code gen. Otherwise, code gen will not convert 
 Aggregates with a SUM function to GeneratedAggregates (the code gen version).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode

2015-03-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376065#comment-14376065
 ] 

Thomas Graves commented on SPARK-6469:
--

Note if its purely the documentation confused you then we should update 
documentation to clarify the client/cluster mode differences.

 Local directories configured for YARN are not used in yarn-client mode
 --

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 spark.local.dir which should be ignored.
 If this works correctly in yarn-cluster mode, I've found out that it is not 
 the case in yarn-client mode.
 The problem seems to originate in the method 
 [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686].
 Indeed, I've checked with a simple application that the {{CONTAINER_ID}} 
 environment variable is correctly set in yarn-cluster mode (to something like 
 {{container_142761810_0151_01_01}}, but not in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode

2015-03-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376061#comment-14376061
 ] 

Thomas Graves commented on SPARK-6469:
--

Are you saying they are not set on the driver node in yarn client mode?  If so 
that is what I would expect since the driver is not running on the YARN 
cluster, its running on the gateway (wherever you launch it).   Is the driver 
now chosing local directories for the executors to use? If not what problem is 
this causing exactly?

 Local directories configured for YARN are not used in yarn-client mode
 --

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 spark.local.dir which should be ignored.
 If this works correctly in yarn-cluster mode, I've found out that it is not 
 the case in yarn-client mode.
 The problem seems to originate in the method 
 [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686].
 Indeed, I've checked with a simple application that the {{CONTAINER_ID}} 
 environment variable is correctly set in yarn-cluster mode (to something like 
 {{container_142761810_0151_01_01}}, but not in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode

2015-03-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376084#comment-14376084
 ] 

Christophe PRÉAUD commented on SPARK-6469:
--

The problem I have is that spark temporary files are written in {{/tmp}} in 
yarn-client mode, but your explanation makes sense, the gateway is indeed not 
on the YARN cluster so this is expected.
I agree though that an update in the documentation to clarify this would be 
welcome :-)

 Local directories configured for YARN are not used in yarn-client mode
 --

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 spark.local.dir which should be ignored.
 If this works correctly in yarn-cluster mode, I've found out that it is not 
 the case in yarn-client mode.
 The problem seems to originate in the method 
 [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686].
 Indeed, I've checked with a simple application that the {{CONTAINER_ID}} 
 environment variable is correctly set in yarn-cluster mode (to something like 
 {{container_142761810_0151_01_01}}, but not in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode

2015-03-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376092#comment-14376092
 ] 

Thomas Graves commented on SPARK-6469:
--

Yeah so you should be able to set the spark.local.dir config to change that 
directory in yarn client mode for the driver.  Executors will still use the 
yarn approved directories. 

We should change this jira to clarify documentation then. 

 Local directories configured for YARN are not used in yarn-client mode
 --

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 spark.local.dir which should be ignored.
 If this works correctly in yarn-cluster mode, I've found out that it is not 
 the case in yarn-client mode.
 The problem seems to originate in the method 
 [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686].
 Indeed, I've checked with a simple application that the {{CONTAINER_ID}} 
 environment variable is correctly set in yarn-cluster mode (to something like 
 {{container_142761810_0151_01_01}}, but not in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode

2015-03-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christophe PRÉAUD updated SPARK-6469:
-
Issue Type: Documentation  (was: Bug)

 Local directories configured for YARN are not used in yarn-client mode
 --

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 spark.local.dir which should be ignored.
 If this works correctly in yarn-cluster mode, I've found out that it is not 
 the case in yarn-client mode.
 The problem seems to originate in the method 
 [isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686].
 Indeed, I've checked with a simple application that the {{CONTAINER_ID}} 
 environment variable is correctly set in yarn-cluster mode (to something like 
 {{container_142761810_0151_01_01}}, but not in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6460) Implement OpensslAesCtrCryptoCodec to enable encrypted shuffle algorithms which openssl provides

2015-03-23 Thread liyunzhang_intel (JIRA)
liyunzhang_intel created SPARK-6460:
---

 Summary: Implement OpensslAesCtrCryptoCodec to enable encrypted 
shuffle algorithms which openssl provides
 Key: SPARK-6460
 URL: https://issues.apache.org/jira/browse/SPARK-6460
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Reporter: liyunzhang_intel


SPARK-5682 only implements the encrypted shuffle algorithm provided by JCE. 
OpensslAesCtrCryptoCodec need implement algorithm provided by openssl.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6460) Implement OpensslAesCtrCryptoCodec to enable encrypted shuffle algorithms which openssl provides

2015-03-23 Thread liyunzhang_intel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liyunzhang_intel updated SPARK-6460:

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-5682

 Implement OpensslAesCtrCryptoCodec to enable encrypted shuffle algorithms 
 which openssl provides
 

 Key: SPARK-6460
 URL: https://issues.apache.org/jira/browse/SPARK-6460
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle
Reporter: liyunzhang_intel

 SPARK-5682 only implements the encrypted shuffle algorithm provided by JCE. 
 OpensslAesCtrCryptoCodec need implement algorithm provided by openssl.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6483) Spark SQL udf(ScalaUdf) is very slow

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377338#comment-14377338
 ] 

Apache Spark commented on SPARK-6483:
-

User 'zzcclp' has created a pull request for this issue:
https://github.com/apache/spark/pull/5154

 Spark SQL udf(ScalaUdf) is very slow
 

 Key: SPARK-6483
 URL: https://issues.apache.org/jira/browse/SPARK-6483
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0, 1.4.0
 Environment: 1. Spark version is 1.3.0 
 2. 3 node per 80G/20C 
 3. read 250G parquet files from hdfs 
Reporter: zzc

 Test case: 
 1. 
 register floor func with command: 
 sqlContext.udf.register(floor, (ts: Int) = ts - ts % 300), 
 then run with sql select chan, floor(ts) as tt, sum(size) from qlogbase3 
 group by chan, floor(ts), 
 *it takes 17 minutes.*
 {quote}
 == Physical Plan ==   
   
 Aggregate false, [chan#23015,PartialGroup#23500], 
 [chan#23015,PartialGroup#23500 AS tt#23494,CombineSum(PartialSum#23499L) AS 
 c2#23495L] 
  Exchange (HashPartitioning [chan#23015,PartialGroup#23500], 54) 
   Aggregate true, [chan#23015,scalaUDF(ts#23016)], 
 [chan#23015,*scalaUDF*(ts#23016) AS PartialGroup#23500,SUM(size#23023L) AS 
 PartialSum#23499L] 
PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[115] at 
 map at newParquet.scala:562 
 {quote}
 2. 
 run with sql select chan, (ts - ts % 300) as tt, sum(size) from qlogbase3 
 group by chan, (ts - ts % 300), 
 *it takes only 5 minutes.*
 {quote}
 == Physical Plan == 
 Aggregate false, [chan#23015,PartialGroup#23349], 
 [chan#23015,PartialGroup#23349 AS tt#23343,CombineSum(PartialSum#23348L) AS 
 c2#23344L]   
  Exchange (HashPartitioning [chan#23015,PartialGroup#23349], 54)   
   Aggregate true, [chan#23015,(ts#23016 - (ts#23016 % 300))], 
 [chan#23015,*(ts#23016 - (ts#23016 % 300))* AS 
 PartialGroup#23349,SUM(size#23023L) AS PartialSum#23348L] 
PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[83] at map 
 at newParquet.scala:562 
 {quote}
 3. 
 use *HiveContext* with sql select chan, floor((ts - ts % 300)) as tt, 
 sum(size) from qlogbase3 group by chan, floor((ts - ts % 300)) 
 *it takes only 5 minutes too. *
 {quote}
 == Physical Plan == 
 Aggregate false, [chan#23015,PartialGroup#23108L], 
 [chan#23015,PartialGroup#23108L AS tt#23102L,CombineSum(PartialSum#23107L) AS 
 _c2#23103L] 
  Exchange (HashPartitioning [chan#23015,PartialGroup#23108L], 54) 
   Aggregate true, 
 [chan#23015,HiveGenericUdf#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016
  - (ts#23016 % 300)))], 
 [chan#23015,*HiveGenericUdf*#org.apache.hadoop.hive.ql.udf.generic.GenericUDFFloor((ts#23016
  - (ts#23016 % 300))) AS PartialGroup#23108L,SUM(size#23023L) AS 
 PartialSum#23107L] 
PhysicalRDD [chan#23015,ts#23016,size#23023L], MapPartitionsRDD[28] at map 
 at newParquet.scala:562 
 {quote}
 *Why? ScalaUdf is so slow?? How to improve it?*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors

2015-03-23 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14377343#comment-14377343
 ] 

Reynold Xin commented on SPARK-3306:


Can you elaborate on why this needs to be Spark and can't live outside? This 
seems to me can be implemented entirely outside of Spark. In particular:

1. Use a global singleton object that manages resources.
2. The singleton can register a shutdown hook to clear resources upon JVM exit.

And it probably would take just a few lines of code to implement the above two.

 Addition of external resource dependency in executors
 -

 Key: SPARK-3306
 URL: https://issues.apache.org/jira/browse/SPARK-3306
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Yan

 Currently, Spark executors only support static and read-only external 
 resources of side files and jar files. With emerging disparate data sources, 
 there is a need to support more versatile external resources, such as 
 connections to data sources, to facilitate efficient data accesses to the 
 sources. For one, the JDBCRDD, with some modifications,  could benefit from 
 this feature by reusing established JDBC connections from the same Spark 
 context before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6477) Run MIMA tests before the Spark test suite

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376740#comment-14376740
 ] 

Apache Spark commented on SPARK-6477:
-

User 'brennonyork' has created a pull request for this issue:
https://github.com/apache/spark/pull/5145

 Run MIMA tests before the Spark test suite
 --

 Key: SPARK-6477
 URL: https://issues.apache.org/jira/browse/SPARK-6477
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Brennon York
Priority: Minor

 Right now the MIMA tests are the last thing to run, yet run very quickly and, 
 if they fail, didn't need the entire Spark test suite to have completed 
 first. I propose we move the MIMA tests to run before the full Spark suite so 
 that builds that fail the MIMA checks will return much faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6229) Support SASL encryption in network/common module

2015-03-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376147#comment-14376147
 ] 

Marcelo Vanzin commented on SPARK-6229:
---

[~adav] the problem of exposing the pipeline as an API is twofold:

* Every application needs to understand the internals of the pipeline. For 
example, in your EncryptionHandler suggestion, SSL and SASL benefit from 
being placed in different locations inside the pipeline. How do you expose that 
in an external API? And why make client code even care about that? Also, I 
don't necessarily agree that SSL is an application concern. It's a 
transport-level protocol - which is one of the reasons it would be placed in a 
separate place in the stack from a SASL handler, for example..

* Handling SASL and SSL inside the network library does not necessarily make it 
any less unit-testable, stable or fast. It just makes it easier for clients to 
use those things. Instead of writing a bunch of code that needs to be 
synchronized between client and server, all they need is a proper configuration 
object. Configuration (= data) is much easier to change and fix than code.

The SecurityManager issue is already solved in the transport library. When I 
moved the SASL code to network/common I moved {{SecretKeyHolder}} with it. So 
there you have, your application-agnostic interface for providing security 
secrets for the network library.

So, again, what I'm suggesting here is not to hardcode SSL and SASL into the 
library. I'm suggesting an easier interface for people to configure SSL and 
SASL that doesn't require writing any extra code. If they don't want either of 
those, they still have that option, but instead of deleting / disabling / 
conditioning code, they'd change a couple of lines in a config file. They'd get 
the same stable, fast network library without SSL nor SASL, without having to 
change a single line of code.

Another problem with your example ({{transport.setEncryptionHandler}} vs. {{if 
(sasl) ...}}) is that the latter would be needed anyway if you want SASL. Why 
not then have also an AuthenticationHandler aside from the encryption handler?

As for your factory comment, that's already there, in a way. The bootstrap 
functionality is basically a way to insert things into the channel being 
instantiated. What I'm proposing here is twofold: first, extend that interface 
so that the bootstrap implementation can modify the pipeline (and also allow 
server bootstraps, for reasons I explained in my first long comment), and 
second, control which bootstraps get activated via configuration, not via code.

Note that, internally within the library, you'd have basically what you're 
saying: SSL and SASL would be plugins of sort that you can insert into the 
transport layer only if you want. The different that I'm trying to convey here 
is that the *external* interface of the library doesn't expose those.

 Support SASL encryption in network/common module
 

 Key: SPARK-6229
 URL: https://issues.apache.org/jira/browse/SPARK-6229
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Marcelo Vanzin

 After SASL support has been added to network/common, supporting encryption 
 should be rather simple. Encryption is supported for DIGEST-MD5 and GSSAPI. 
 Since the latter requires a valid kerberos login to work (and so doesn't 
 really work with executors), encryption would require the use of DIGEST-MD5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4746) integration tests should be separated from faster unit tests

2015-03-23 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid closed SPARK-4746.
---
Resolution: Won't Fix

looks like there isn't interest in this, closing to clean up jira

 integration tests should be separated from faster unit tests
 

 Key: SPARK-4746
 URL: https://issues.apache.org/jira/browse/SPARK-4746
 Project: Spark
  Issue Type: Bug
  Components: Tests
Reporter: Imran Rashid
Assignee: Imran Rashid
Priority: Trivial

 Currently there isn't a good way for a developer to skip the longer 
 integration tests.  This can slow down local development.  See 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html
 One option is to use scalatest's notion of test tags to tag all integration 
 tests, so they could easily be skipped



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6469) Local directories configured for YARN are not used in yarn-client mode

2015-03-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christophe PRÉAUD updated SPARK-6469:
-
Description: 
According to the [Spark YARN doc 
page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], 
Spark executors will use the local directories configured for YARN, not 
{{spark.local.dir}} which should be ignored.

It should be noted though that in yarn-client mode, though the executors will 
indeed use the local directories configured for YARN, the driver will not, 
because it is not running on the YARN cluster; the driver in yarn-client will 
use the local directories defined in {{spark.local.dir}}

  was:
According to the [Spark YARN doc 
page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], 
Spark executors will use the local directories configured for YARN, not 
spark.local.dir which should be ignored.

If this works correctly in yarn-cluster mode, I've found out that it is not the 
case in yarn-client mode.
The problem seems to originate in the method 
[isRunningInYarnContainer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L686].

Indeed, I've checked with a simple application that the {{CONTAINER_ID}} 
environment variable is correctly set in yarn-cluster mode (to something like 
{{container_142761810_0151_01_01}}, but not in yarn-client mode.


 Local directories configured for YARN are not used in yarn-client mode
 --

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 {{spark.local.dir}} which should be ignored.
 It should be noted though that in yarn-client mode, though the executors will 
 indeed use the local directories configured for YARN, the driver will not, 
 because it is not running on the YARN cluster; the driver in yarn-client will 
 use the local directories defined in {{spark.local.dir}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6469) The YARN driver in yarn-client mode will not use the local directories configured for YARN

2015-03-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christophe PRÉAUD updated SPARK-6469:
-
Summary: The YARN driver in yarn-client mode will not use the local 
directories configured for YARN   (was: Local directories configured for YARN 
are not used in yarn-client mode)

 The YARN driver in yarn-client mode will not use the local directories 
 configured for YARN 
 ---

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 {{spark.local.dir}} which should be ignored.
 It should be noted though that in yarn-client mode, though the executors will 
 indeed use the local directories configured for YARN, the driver will not, 
 because it is not running on the YARN cluster; the driver in yarn-client will 
 use the local directories defined in {{spark.local.dir}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6469) The YARN driver in yarn-client mode will not use the local directories configured for YARN

2015-03-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christophe PRÉAUD updated SPARK-6469:
-
Description: 
According to the [Spark YARN doc 
page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], 
Spark executors will use the local directories configured for YARN, not 
{{spark.local.dir}} which should be ignored.

However it should be noted that in yarn-client mode, though the executors will 
indeed use the local directories configured for YARN, the driver will not, 
because it is not running on the YARN cluster; the driver in yarn-client will 
use the local directories defined in {{spark.local.dir}}

Can this please be clarified in the Spark YARN documentation above?

  was:
According to the [Spark YARN doc 
page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes], 
Spark executors will use the local directories configured for YARN, not 
{{spark.local.dir}} which should be ignored.

It should be noted though that in yarn-client mode, though the executors will 
indeed use the local directories configured for YARN, the driver will not, 
because it is not running on the YARN cluster; the driver in yarn-client will 
use the local directories defined in {{spark.local.dir}}


 The YARN driver in yarn-client mode will not use the local directories 
 configured for YARN 
 ---

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 {{spark.local.dir}} which should be ignored.
 However it should be noted that in yarn-client mode, though the executors 
 will indeed use the local directories configured for YARN, the driver will 
 not, because it is not running on the YARN cluster; the driver in yarn-client 
 will use the local directories defined in {{spark.local.dir}}
 Can this please be clarified in the Spark YARN documentation above?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6469) The YARN driver in yarn-client mode will not use the local directories configured for YARN

2015-03-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376111#comment-14376111
 ] 

Christophe PRÉAUD commented on SPARK-6469:
--

I've changed the JIRA title, type and description, is this ok?
Thanks to all of you for your help!

 The YARN driver in yarn-client mode will not use the local directories 
 configured for YARN 
 ---

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 {{spark.local.dir}} which should be ignored.
 However it should be noted that in yarn-client mode, though the executors 
 will indeed use the local directories configured for YARN, the driver will 
 not, because it is not running on the YARN cluster; the driver in yarn-client 
 will use the local directories defined in {{spark.local.dir}}
 Can this please be clarified in the Spark YARN documentation above?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6469) The YARN driver in yarn-client mode will not use the local directories configured for YARN

2015-03-23 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-6469:
-
Component/s: (was: Spark Core)
 YARN

 The YARN driver in yarn-client mode will not use the local directories 
 configured for YARN 
 ---

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Documentation
  Components: YARN
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 {{spark.local.dir}} which should be ignored.
 However it should be noted that in yarn-client mode, though the executors 
 will indeed use the local directories configured for YARN, the driver will 
 not, because it is not running on the YARN cluster; the driver in yarn-client 
 will use the local directories defined in {{spark.local.dir}}
 Can this please be clarified in the Spark YARN documentation above?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6469) The YARN driver in yarn-client mode will not use the local directories configured for YARN

2015-03-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376119#comment-14376119
 ] 

Thomas Graves commented on SPARK-6469:
--

looks good, thanks.

 The YARN driver in yarn-client mode will not use the local directories 
 configured for YARN 
 ---

 Key: SPARK-6469
 URL: https://issues.apache.org/jira/browse/SPARK-6469
 Project: Spark
  Issue Type: Documentation
  Components: YARN
Reporter: Christophe PRÉAUD
Priority: Minor
 Attachments: TestYarnVars.scala


 According to the [Spark YARN doc 
 page|http://spark.apache.org/docs/latest/running-on-yarn.html#important-notes],
  Spark executors will use the local directories configured for YARN, not 
 {{spark.local.dir}} which should be ignored.
 However it should be noted that in yarn-client mode, though the executors 
 will indeed use the local directories configured for YARN, the driver will 
 not, because it is not running on the YARN cluster; the driver in yarn-client 
 will use the local directories defined in {{spark.local.dir}}
 Can this please be clarified in the Spark YARN documentation above?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6308) VectorUDT is displayed as `vecto` in dtypes

2015-03-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6308:
-
Assignee: Manoj Kumar

 VectorUDT is displayed as `vecto` in dtypes
 ---

 Key: SPARK-6308
 URL: https://issues.apache.org/jira/browse/SPARK-6308
 Project: Spark
  Issue Type: Bug
  Components: MLlib, SQL
Reporter: Xiangrui Meng
Assignee: Manoj Kumar

 VectorUDT should override simpleString instead of relying on the default 
 implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6435) spark-shell --jars option does not add all jars to classpath

2015-03-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6435:
-
Component/s: Windows

Great debugging! [~tsudukim] do you have thoughts on this? I think this bit was 
part of your change in 
https://github.com/apache/spark/commit/8d932475e6759e869c16ce6cac203a2e56558716#diff-7ac5881d6bad553b23f5225775c8fde3

So, it sounds like you do need to quote the comma-separated arg? but then 
quoting doesn't work as expected?

The {{x%2==x}} idiom is used several places in the Windows scripts. Is the 
square bracket syntax definitely preferred?

 spark-shell --jars option does not add all jars to classpath
 

 Key: SPARK-6435
 URL: https://issues.apache.org/jira/browse/SPARK-6435
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Win64
Reporter: vijay

 Not all jars supplied via the --jars option will be added to the driver (and 
 presumably executor) classpath.  The first jar(s) will be added, but not all.
 To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
 then try to import a class from the last jar.  This fails.  A simple 
 reproducer: 
 Create a bunch of dummy jars:
 jar cfM jar1.jar log.txt
 jar cfM jar2.jar log.txt
 jar cfM jar3.jar log.txt
 jar cfM jar4.jar log.txt
 Start the spark-shell with the dummy jars and guava at the end:
 %SPARK_HOME%\bin\spark-shell --master local --jars 
 jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
 In the shell, try importing from guava; you'll get an error:
 {code}
 scala import com.google.common.base.Strings
 console:19: error: object Strings is not a member of package 
 com.google.common.base
import com.google.common.base.Strings
   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6320) Adding new query plan strategy to SQLContext

2015-03-23 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376434#comment-14376434
 ] 

Michael Armbrust commented on SPARK-6320:
-

If that can be done in a minimally invasive way that sounds reasonable to me.

 Adding new query plan strategy to SQLContext
 

 Key: SPARK-6320
 URL: https://issues.apache.org/jira/browse/SPARK-6320
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Youssef Hatem
Priority: Minor

 Hi,
 I would like to add a new strategy to {{SQLContext}}. To do this I created a 
 new class which extends {{Strategy}}. In my new class I need to call 
 {{planLater}} function. However this method is defined in {{SparkPlanner}} 
 (which itself inherits the method from {{QueryPlanner}}).
 To my knowledge the only way to make {{planLater}} function visible to my new 
 strategy is to define my strategy inside another class that extends 
 {{SparkPlanner}} and inherits {{planLater}} as a result, by doing so I will 
 have to extend the {{SQLContext}} such that I can override the {{planner}} 
 field with the new {{Planner}} class I created.
 It seems that this is a design problem because adding a new strategy seems to 
 require extending {{SQLContext}} (unless I am doing it wrong and there is a 
 better way to do it).
 Thanks a lot,
 Youssef



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6200) Support dialect in SQL

2015-03-23 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376450#comment-14376450
 ] 

Michael Armbrust commented on SPARK-6200:
-

Making comments there would be a good idea.  Generally, I don't think extra 
complexity is worth it just to have short names.  Likely most users will either 
use the default value or will just be cutting and copying from some 
documentation into a config file once.  If this was an interface we expected 
them to toggle a lot it would probably be different.

 Support dialect in SQL
 --

 Key: SPARK-6200
 URL: https://issues.apache.org/jira/browse/SPARK-6200
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: haiyang

 Created a new dialect manager,support dialect command and add new dialect use 
 sql statement etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6200) Support dialect in SQL

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6200.
-
Resolution: Duplicate

 Support dialect in SQL
 --

 Key: SPARK-6200
 URL: https://issues.apache.org/jira/browse/SPARK-6200
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: haiyang

 Created a new dialect manager,support dialect command and add new dialect use 
 sql statement etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6255) Python MLlib API missing items: Classification

2015-03-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6255:
-
Assignee: Yanbo Liang

 Python MLlib API missing items: Classification
 --

 Key: SPARK-6255
 URL: https://issues.apache.org/jira/browse/SPARK-6255
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 LogisticRegressionWithLBFGS
 * setNumClasses
 * setValidateData
 LogisticRegressionModel
 * getThreshold
 * numClasses
 * numFeatures
 SVMWithSGD
 * setValidateData
 SVMModel
 * getThreshold



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5928) Remote Shuffle Blocks cannot be more than 2 GB

2015-03-23 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376430#comment-14376430
 ] 

Imran Rashid commented on SPARK-5928:
-

sorry to hear that [~douglaz].  To help understand / prioritize this, can you 
share a bit more info?

a) how much data were you shuffling?
b) were you able to fix this by increasing the number of partitions?  how many 
partitions did you need to use in the end?
c) did you get a mix of snappy errors as well?
d) did you also run into SPARK-5945 as a result of your failures ?

thanks

 Remote Shuffle Blocks cannot be more than 2 GB
 --

 Key: SPARK-5928
 URL: https://issues.apache.org/jira/browse/SPARK-5928
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid

 If a shuffle block is over 2GB, the shuffle fails, with an uninformative 
 exception.  The tasks get retried a few times and then eventually the job 
 fails.
 Here is an example program which can cause the exception:
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}
 Note that you can't trigger this exception in local mode, it only happens on 
 remote fetches.   I triggered these exceptions running with 
 {{MASTER=yarn-client spark-shell --num-executors 2 --executor-memory 4000m}}
 {noformat}
 15/02/20 11:10:23 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 
 imran-3.ent.cloudera.com): FetchFailed(BlockManagerId(1, 
 imran-2.ent.cloudera.com, 55028), shuffleId=1, mapId=0, reduceId=0, message=
 org.apache.spark.shuffle.FetchFailedException: Adjusted frame length exceeds 
 2147483647: 3021252889 - discarded
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.org$apache$spark$shuffle$hash$BlockStoreShuffleFetcher$$unpackBlock$1(BlockStoreShuffleFetcher.scala:67)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at 
 org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
   at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
   at 
 org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:46)
   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: io.netty.handler.codec.TooLongFrameException: Adjusted frame 
 length exceeds 2147483647: 3021252889 - discarded
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.fail(LengthFieldBasedFrameDecoder.java:501)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.failIfNecessary(LengthFieldBasedFrameDecoder.java:477)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:403)
   at 
 io.netty.handler.codec.LengthFieldBasedFrameDecoder.decode(LengthFieldBasedFrameDecoder.java:343)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:249)
   at 
 io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:149)
   at 
 io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
   at 
 io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
   at 
 io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
   at 
 io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
   at 
 io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
   at 
 

[jira] [Commented] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]

2015-03-23 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376495#comment-14376495
 ] 

Patrick Wendell commented on SPARK-2331:


By the way - [~rxin] recently pointed out to me that EmptyRDD is private[spark].

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/EmptyRDD.scala#L27

Given that I'm sort of confused how people were using it before. I'm not 
totally sure how making a class private[spark] affects its use in a return type.

 SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
 --

 Key: SPARK-2331
 URL: https://issues.apache.org/jira/browse/SPARK-2331
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Ian Hummel
Priority: Minor

 The return type for SparkContext.emptyRDD is EmptyRDD[T].
 It should be RDD[T].  That means you have to add extra type annotations on 
 code like the below (which creates a union of RDDs over some subset of paths 
 in a folder)
 {code}
 val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
 (rdd, path) ⇒
   rdd.union(sc.textFile(path))
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6472) Elements of an array of structs cannot be accessed.

2015-03-23 Thread Yin Huai (JIRA)
Yin Huai created SPARK-6472:
---

 Summary: Elements of an array of structs cannot be accessed.
 Key: SPARK-6472
 URL: https://issues.apache.org/jira/browse/SPARK-6472
 Project: Spark
  Issue Type: Bug
Reporter: Yin Huai
Priority: Blocker


I tried the following snippet with HiveContext.
{code}
import sqlContext._
val rdd = sc.parallelize({a:[{b:1}, {b:2}]} :: Nil)
val df = jsonRDD(rdd)
df.registerTempTable(jt)

// This one does not work.
df.select(a[0]).collect

// This one is fine.
sql(select a[0] from jt).collect
{code}

The exception is 
{code}
df.select(a[0]).collect

org.apache.spark.sql.AnalysisException: cannot resolve 'a[0]' given input 
columns a;
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:104)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:118)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:117)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:122)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:45)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:43)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:88)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.apply(CheckAnalysis.scala:43)
at 
org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1108)
at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133)
at 
org.apache.spark.sql.DataFrame.logicalPlanToDataFrame(DataFrame.scala:157)
at org.apache.spark.sql.DataFrame.select(DataFrame.scala:465)
at org.apache.spark.sql.DataFrame.select(DataFrame.scala:480)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6472) Elements of an array of structs cannot be accessed.

2015-03-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6472:

Component/s: SQL

 Elements of an array of structs cannot be accessed.
 ---

 Key: SPARK-6472
 URL: https://issues.apache.org/jira/browse/SPARK-6472
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker

 I tried the following snippet with HiveContext.
 {code}
 import sqlContext._
 val rdd = sc.parallelize({a:[{b:1}, {b:2}]} :: Nil)
 val df = jsonRDD(rdd)
 df.registerTempTable(jt)
 // This one does not work.
 df.select(a[0]).collect
 // This one is fine.
 sql(select a[0] from jt).collect
 {code}
 The exception is 
 {code}
 df.select(a[0]).collect
 org.apache.spark.sql.AnalysisException: cannot resolve 'a[0]' given input 
 columns a;
   at 
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:104)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:118)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:117)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:122)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:45)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:43)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:88)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis.apply(CheckAnalysis.scala:43)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1108)
   at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133)
   at 
 org.apache.spark.sql.DataFrame.logicalPlanToDataFrame(DataFrame.scala:157)
   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:465)
   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:480)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6472) Elements of an array of structs cannot be accessed.

2015-03-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-6472.
-
Resolution: Not a Problem

It is not a problem. For select, we support column name string. I need to use 
selectExpr to access an array element.

 Elements of an array of structs cannot be accessed.
 ---

 Key: SPARK-6472
 URL: https://issues.apache.org/jira/browse/SPARK-6472
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker

 I tried the following snippet with HiveContext.
 {code}
 import sqlContext._
 val rdd = sc.parallelize({a:[{b:1}, {b:2}]} :: Nil)
 val df = jsonRDD(rdd)
 df.registerTempTable(jt)
 // This one does not work.
 df.select(a[0]).collect
 // This one is fine.
 sql(select a[0] from jt).collect
 {code}
 The exception is 
 {code}
 df.select(a[0]).collect
 org.apache.spark.sql.AnalysisException: cannot resolve 'a[0]' given input 
 columns a;
   at 
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:104)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:118)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:117)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:122)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:45)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3.apply(CheckAnalysis.scala:43)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:88)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis.apply(CheckAnalysis.scala:43)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1108)
   at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133)
   at 
 org.apache.spark.sql.DataFrame.logicalPlanToDataFrame(DataFrame.scala:157)
   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:465)
   at org.apache.spark.sql.DataFrame.select(DataFrame.scala:480)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2331) SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]

2015-03-23 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2331:
---
Description: 
The return type for SparkContext.emptyRDD is EmptyRDD[T].

It should be RDD[T].  That means you have to add extra type annotations on code 
like the below (which creates a union of RDDs over some subset of paths in a 
folder)

{code}
val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
(rdd, path) ⇒
  rdd.union(sc.textFile(path))
}
{code}

  was:
The return type for SparkContext.emptyRDD is EmptyRDD[T].

It should be RDD[T].  That means you have to add extra type annotations on code 
like the below (which creates a union of RDDs over some subset of paths in a 
folder)

val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
(rdd, path) ⇒
  rdd.union(sc.textFile(path))
}


 SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
 --

 Key: SPARK-2331
 URL: https://issues.apache.org/jira/browse/SPARK-2331
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Ian Hummel
Priority: Minor

 The return type for SparkContext.emptyRDD is EmptyRDD[T].
 It should be RDD[T].  That means you have to add extra type annotations on 
 code like the below (which creates a union of RDDs over some subset of paths 
 in a folder)
 {code}
 val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
 (rdd, path) ⇒
   rdd.union(sc.textFile(path))
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException

2015-03-23 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376454#comment-14376454
 ] 

Imran Rashid commented on SPARK-5945:
-

Hi [~ilganeli],

sorry for taking a while to respond.  I think the main issue here is not so 
much just implementing the code (as [~SuYan] already has shown the small 
required patch).  The big issue is figuring out what the desired semantics are 
(see the questions I listed above), which means just getting feedback from all 
the required people on this one.  But if you want to drive that process, that 
sounds great, it would really be appreciated!

 Spark should not retry a stage infinitely on a FetchFailedException
 ---

 Key: SPARK-5945
 URL: https://issues.apache.org/jira/browse/SPARK-5945
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid

 While investigating SPARK-5928, I noticed some very strange behavior in the 
 way spark retries stages after a FetchFailedException.  It seems that on a 
 FetchFailedException, instead of simply killing the task and retrying, Spark 
 aborts the stage and retries.  If it just retried the task, the task might 
 fail 4 times and then trigger the usual job killing mechanism.  But by 
 killing the stage instead, the max retry logic is skipped (it looks to me 
 like there is no limit for retries on a stage).
 After a bit of discussion with Kay Ousterhout, it seems the idea is that if a 
 fetch fails, we assume that the block manager we are fetching from has 
 failed, and that it will succeed if we retry the stage w/out that block 
 manager.  In that case, it wouldn't make any sense to retry the task, since 
 its doomed to fail every time, so we might as well kill the whole stage.  But 
 this raises two questions:
 1) Is it really safe to assume that a FetchFailedException means that the 
 BlockManager has failed, and ti will work if we just try another one?  
 SPARK-5928 shows that there are at least some cases where that assumption is 
 wrong.  Even if we fix that case, this logic seems brittle to the next case 
 we find.  I guess the idea is that this behavior is what gives us the R in 
 RDD ... but it seems like its not really that robust and maybe should be 
 reconsidered.
 2) Should stages only be retried a limited number of times?  It would be 
 pretty easy to put in a limited number of retries per stage.  Though again, 
 we encounter issues with keeping things resilient.  Theoretically one stage 
 could have many retries, but due to failures in different stages further 
 downstream, so we might need to track the cause of each retry as well to 
 still have the desired behavior.
 In general it just seems there is some flakiness in the retry logic.  This is 
 the only reproducible example I have at the moment, but I vaguely recall 
 hitting other cases of strange behavior w/ retries when trying to run long 
 pipelines.  Eg., if one executor is stuck in a GC during a fetch, the fetch 
 fails, but the executor eventually comes back and the stage gets retried 
 again, but the same GC issues happen the second time around, etc.
 Copied from SPARK-5928, here's the example program that can regularly produce 
 a loop of stage failures.  Note that it will only fail from a remote fetch, 
 so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell 
 --num-executors 2 --executor-memory 4000m}}
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException

2015-03-23 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-5945:

Assignee: Ilya Ganelin

 Spark should not retry a stage infinitely on a FetchFailedException
 ---

 Key: SPARK-5945
 URL: https://issues.apache.org/jira/browse/SPARK-5945
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid
Assignee: Ilya Ganelin

 While investigating SPARK-5928, I noticed some very strange behavior in the 
 way spark retries stages after a FetchFailedException.  It seems that on a 
 FetchFailedException, instead of simply killing the task and retrying, Spark 
 aborts the stage and retries.  If it just retried the task, the task might 
 fail 4 times and then trigger the usual job killing mechanism.  But by 
 killing the stage instead, the max retry logic is skipped (it looks to me 
 like there is no limit for retries on a stage).
 After a bit of discussion with Kay Ousterhout, it seems the idea is that if a 
 fetch fails, we assume that the block manager we are fetching from has 
 failed, and that it will succeed if we retry the stage w/out that block 
 manager.  In that case, it wouldn't make any sense to retry the task, since 
 its doomed to fail every time, so we might as well kill the whole stage.  But 
 this raises two questions:
 1) Is it really safe to assume that a FetchFailedException means that the 
 BlockManager has failed, and ti will work if we just try another one?  
 SPARK-5928 shows that there are at least some cases where that assumption is 
 wrong.  Even if we fix that case, this logic seems brittle to the next case 
 we find.  I guess the idea is that this behavior is what gives us the R in 
 RDD ... but it seems like its not really that robust and maybe should be 
 reconsidered.
 2) Should stages only be retried a limited number of times?  It would be 
 pretty easy to put in a limited number of retries per stage.  Though again, 
 we encounter issues with keeping things resilient.  Theoretically one stage 
 could have many retries, but due to failures in different stages further 
 downstream, so we might need to track the cause of each retry as well to 
 still have the desired behavior.
 In general it just seems there is some flakiness in the retry logic.  This is 
 the only reproducible example I have at the moment, but I vaguely recall 
 hitting other cases of strange behavior w/ retries when trying to run long 
 pipelines.  Eg., if one executor is stuck in a GC during a fetch, the fetch 
 fails, but the executor eventually comes back and the stage gets retried 
 again, but the same GC issues happen the second time around, etc.
 Copied from SPARK-5928, here's the example program that can regularly produce 
 a loop of stage failures.  Note that it will only fail from a remote fetch, 
 so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell 
 --num-executors 2 --executor-memory 4000m}}
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6435) spark-shell --jars option does not add all jars to classpath

2015-03-23 Thread vijay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14376501#comment-14376501
 ] 

vijay commented on SPARK-6435:
--

I came up with square brackets after 2 minutes of googling/stackoverflowing; a 
more thorough search/understanding of bat scripts might result in a 
better/different solution (I can rule myself out of the more thorough bat 
script understanding).  That being said, this test is used to check for an 
empty string.  Square brackets is the most upvoted solution: 
http://stackoverflow.com/questions/2541767/what-is-the-proper-way-to-test-if-variable-is-empty-in-a-batch-file-if-not-1


 spark-shell --jars option does not add all jars to classpath
 

 Key: SPARK-6435
 URL: https://issues.apache.org/jira/browse/SPARK-6435
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Win64
Reporter: vijay

 Not all jars supplied via the --jars option will be added to the driver (and 
 presumably executor) classpath.  The first jar(s) will be added, but not all.
 To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
 then try to import a class from the last jar.  This fails.  A simple 
 reproducer: 
 Create a bunch of dummy jars:
 jar cfM jar1.jar log.txt
 jar cfM jar2.jar log.txt
 jar cfM jar3.jar log.txt
 jar cfM jar4.jar log.txt
 Start the spark-shell with the dummy jars and guava at the end:
 %SPARK_HOME%\bin\spark-shell --master local --jars 
 jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
 In the shell, try importing from guava; you'll get an error:
 {code}
 scala import com.google.common.base.Strings
 console:19: error: object Strings is not a member of package 
 com.google.common.base
import com.google.common.base.Strings
   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6443) Could not submit app in standalone cluster mode when HA is enabled

2015-03-23 Thread Tao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang updated SPARK-6443:

Description: 
After digging some codes, I found user could not submit app in standalone 
cluster mode when HA is enabled. But in client mode it can work.

Haven't try yet. But I will verify this and file a PR to resolve it if the 
problem exists.

3/23 update:
I started a HA cluster with zk, and tried to submit SparkPi example with 
command:
*./spark-submit  --class org.apache.spark.examples.SparkPi --master 
spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
../lib/spark-examples-1.2.0-hadoop2.4.0.jar *

and it failed with error message:
??Spark assembly has been built with Hive, including Datanucleus jars on 
classpath
15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
akka.actor.ActorInitializationException: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
at akka.actor.ActorCell.create(ActorCell.scala:596)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.spark.SparkException: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
at akka.actor.ActorCell.create(ActorCell.scala:580)
... 9 more??

So my guess is right. I will fix it in related PR.



  was:
After digging some codes, I found user could not submit app in standalone 
cluster mode when HA is enabled. But in client mode it can work.

Haven't try yet. But I will verify this and file a PR to resolve it if the 
problem exists.




 Could not submit app in standalone cluster mode when HA is enabled
 --

 Key: SPARK-6443
 URL: https://issues.apache.org/jira/browse/SPARK-6443
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Reporter: Tao Wang

 After digging some codes, I found user could not submit app in standalone 
 cluster mode when HA is enabled. But in client mode it can work.
 Haven't try yet. But I will verify this and file a PR to resolve it if the 
 problem exists.
 3/23 update:
 I started a HA cluster with zk, and tried to submit SparkPi example with 
 command:
 *./spark-submit  --class org.apache.spark.examples.SparkPi --master 
 spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
 ../lib/spark-examples-1.2.0-hadoop2.4.0.jar *
 and it failed with error message:
 ??Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
 spark://doggie153:7077,doggie159:7077
 akka.actor.ActorInitializationException: exception during creation
 at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
 at akka.actor.ActorCell.create(ActorCell.scala:596)
 at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
 at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
 at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: org.apache.spark.SparkException: Invalid master URL: 
 spark://doggie153:7077,doggie159:7077
 at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
 at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
 at 

[jira] [Created] (SPARK-6465) GenericRowWithSchema: KryoException: Class cannot be created (missing no-arg constructor):

2015-03-23 Thread Earthson Lu (JIRA)
Earthson Lu created SPARK-6465:
--

 Summary: GenericRowWithSchema: KryoException: Class cannot be 
created (missing no-arg constructor):
 Key: SPARK-6465
 URL: https://issues.apache.org/jira/browse/SPARK-6465
 Project: Spark
  Issue Type: Bug
  Components: DataFrame
Affects Versions: 1.3.0
 Environment: Spark 1.3, YARN 2.6.0, CentOS
Reporter: Earthson Lu


I can not find a issue for this. 

register for GenericRowWithSchema is lost in  
org.apache.spark.sql.execution.SparkSqlSerializer.

Is this the only thing we need to do?

Here is the log
```
15/03/23 16:21:00 WARN TaskSetManager: Lost task 9.0 in stage 20.0 (TID 31978, 
datanode06.site): com.esotericsoftware.kryo.KryoException: Class cannot be 
created (missing no-arg constructor): 
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
at com.esotericsoftware.kryo.Kryo.newInstantiator(Kryo.java:1050)
at com.esotericsoftware.kryo.Kryo.newInstance(Kryo.java:1062)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.create(FieldSerializer.java:228)
at 
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:217)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:138)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:66)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:217)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6466) Remove unnecessary attributes when resolving GroupingSets

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375657#comment-14375657
 ] 

Apache Spark commented on SPARK-6466:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5134

 Remove unnecessary attributes when resolving GroupingSets
 -

 Key: SPARK-6466
 URL: https://issues.apache.org/jira/browse/SPARK-6466
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor

 When resolving GroupingSets, we currently list all outputs of GroupingSets's 
 child plan. However, the columns that are not in groupBy expressions and not 
 used by aggregation expressions are unnecessary and can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6467) Override QueryPlan.missingInput when necessary and rely on it CheckAnalysis

2015-03-23 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-6467:
-

 Summary: Override QueryPlan.missingInput when necessary and rely 
on it CheckAnalysis
 Key: SPARK-6467
 URL: https://issues.apache.org/jira/browse/SPARK-6467
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Priority: Minor


Currently, some LogicalPlans do not override missingInput, but they should. 
Then, the lack of proper missingInput implementations leaks to CheckAnalysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput

2015-03-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6397:
--
Description: Virtual columns like GROUPING__ID should never be considered 
as missing input, and thus should be execluded from {{QueryPlan.missingInput}}. 
 (was: Currently, some LogicalPlans do not override missingInput, but they 
should. Then, the lack of proper missingInput implementations leaks to 
CheckAnalysis.)

 Exclude virtual columns from QueryPlan.missingInput
 ---

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yadong Qi
Assignee: Yadong Qi
Priority: Minor

 Virtual columns like GROUPING__ID should never be considered as missing 
 input, and thus should be execluded from {{QueryPlan.missingInput}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1480) Choose classloader consistently inside of Spark codebase

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375572#comment-14375572
 ] 

Littlestar commented on SPARK-1480:
---

I meet this bug on spark 1.3.0  + mesos 0.21.1
100%..

I0323 16:32:18.933440 14504 fetcher.cpp:64] Extracted resource 
'/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S4/frameworks/20150323-152848-1214949568-5050-21134-0009/executors/20150323-100710-1214949568-5050-3453-S4/runs/3d8f22f5-7fed-44ed-b5f9-98a219133911/spark-1.3.0-bin-2.4.0.tar.gz'
 into 
'/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S4/frameworks/20150323-152848-1214949568-5050-21134-0009/executors/20150323-100710-1214949568-5050-3453-S4/runs/3d8f22f5-7fed-44ed-b5f9-98a219133911'
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/spark/executor/MesosExecutorBackend
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.executor.MesosExecutorBackend
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
Could not find the main class: org.apache.spark.executor.MesosExecutorBackend

 Choose classloader consistently inside of Spark codebase
 

 Key: SPARK-1480
 URL: https://issues.apache.org/jira/browse/SPARK-1480
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 The Spark codebase is not always consistent on which class loader it uses 
 when classlaoders are explicitly passed to things like serializers. This 
 caused SPARK-1403 and also causes a bug where when the driver has a modified 
 context class loader it is not translated correctly in local mode to the 
 (local) executor.
 In most cases what we want is the following behavior:
 1. If there is a context classloader on the thread, use that.
 2. Otherwise use the classloader that loaded Spark.
 We should just have a utility function for this and call that function 
 whenever we need to get a classloader.
 Note that SPARK-1403 is a workaround for this exact problem (it sets the 
 context class loader because downstream code assumes it is set). Once this 
 gets fixed in a more general way SPARK-1403 can be reverted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3720) support ORC in spark sql

2015-03-23 Thread iward (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375590#comment-14375590
 ] 

iward edited comment on SPARK-3720 at 3/23/15 9:10 AM:
---

hi,[~zhzhan] , I have the same problem with your issues of spark-2883.And I 
just contact orcFile on spark,I can not quite understand your patch ,I would 
like to ask you a few questions:
#1,why spark would read the whole files,what's the detail of problem on spark?
#2,could you tell me what should we do to solve the problem?
thanks


was (Author: iward):
hi,[~zhzhan] , I have the same problem.And I just contact orcFile on spark,I 
can not quite understand your patch ,I would like to ask you a few questions:
#1,why spark would read the whole files,what's the detail of problem on spark?
#2,could you tell me what should we do to solve the problem?
thanks

 support ORC in spark sql
 

 Key: SPARK-3720
 URL: https://issues.apache.org/jira/browse/SPARK-3720
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: Fei Wang
 Attachments: orc.diff


 The Optimized Row Columnar (ORC) file format provides a highly efficient way 
 to store data on hdfs.ORC file format has many advantages such as:
 1 a single file as the output of each task, which reduces the NameNode's load
 2 Hive type support including datetime, decimal, and the complex types 
 (struct, list, map, and union)
 3 light-weight indexes stored within the file
 skip row groups that don't pass predicate filtering
 seek to a given row
 4 block-mode compression based on data type
 run-length encoding for integer columns
 dictionary encoding for string columns
 5 concurrent reads of the same file using separate RecordReaders
 6 ability to split files without scanning for markers
 7 bound the amount of memory needed for reading or writing
 8 metadata stored using Protocol Buffers, which allows addition and removal 
 of fields
 Now spark sql support Parquet, support ORC provide people more opts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6430) Cannot resolve column correctlly when using left semi join

2015-03-23 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375453#comment-14375453
 ] 

Michael Armbrust commented on SPARK-6430:
-

Actually, I might be wrong.  Let me investigate.

 Cannot resolve column correctlly when using left semi join
 --

 Key: SPARK-6430
 URL: https://issues.apache.org/jira/browse/SPARK-6430
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Spark 1.3.0 on yarn mode
Reporter: zzc

 My code:
 {quote}
 case class TestData(key: Int, value: String)
 case class TestData2(a: Int, b: Int)
 import org.apache.spark.sql.execution.joins._
 import sqlContext.implicits._
 val testData = sc.parallelize(
 (1 to 100).map(i = TestData(i, i.toString))).toDF()
 testData.registerTempTable(testData)
 val testData2 = sc.parallelize(
   TestData2(1, 1) ::
   TestData2(1, 2) ::
   TestData2(2, 1) ::
   TestData2(2, 2) ::
   TestData2(3, 1) ::
   TestData2(3, 2) :: Nil, 2).toDF()
 testData2.registerTempTable(testData2)
 //val tmp = sqlContext.sql(SELECT * FROM testData *LEFT SEMI JOIN* testData2 
 ON key = a )
 val tmp = sqlContext.sql(SELECT testData2.b, count(testData2.b) FROM 
 testData *LEFT SEMI JOIN* testData2 ON key = testData2.a group by 
 testData2.b)
 tmp.explain()
 {quote}
 Error log:
 {quote}
 org.apache.spark.sql.AnalysisException: cannot resolve 'testData2.b' given 
 input columns key, value; line 1 pos 108
   at 
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
 {quote}
 {quote}SELECT * FROM testData LEFT SEMI JOIN testData2 ON key = a{quote} is 
 correct, 
 {quote}
 SELECT a FROM testData LEFT SEMI JOIN testData2 ON key = a
 SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = a group by b
 SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = testData2.a 
 group by testData2.b
 SELECT testData2.b, count(testData2.b) FROM testData LEFT SEMI JOIN testData2 
 ON key = testData2.a group by testData2.b
 {quote} are incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-6430) Cannot resolve column correctlly when using left semi join

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reopened SPARK-6430:
-

 Cannot resolve column correctlly when using left semi join
 --

 Key: SPARK-6430
 URL: https://issues.apache.org/jira/browse/SPARK-6430
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Spark 1.3.0 on yarn mode
Reporter: zzc

 My code:
 {quote}
 case class TestData(key: Int, value: String)
 case class TestData2(a: Int, b: Int)
 import org.apache.spark.sql.execution.joins._
 import sqlContext.implicits._
 val testData = sc.parallelize(
 (1 to 100).map(i = TestData(i, i.toString))).toDF()
 testData.registerTempTable(testData)
 val testData2 = sc.parallelize(
   TestData2(1, 1) ::
   TestData2(1, 2) ::
   TestData2(2, 1) ::
   TestData2(2, 2) ::
   TestData2(3, 1) ::
   TestData2(3, 2) :: Nil, 2).toDF()
 testData2.registerTempTable(testData2)
 //val tmp = sqlContext.sql(SELECT * FROM testData *LEFT SEMI JOIN* testData2 
 ON key = a )
 val tmp = sqlContext.sql(SELECT testData2.b, count(testData2.b) FROM 
 testData *LEFT SEMI JOIN* testData2 ON key = testData2.a group by 
 testData2.b)
 tmp.explain()
 {quote}
 Error log:
 {quote}
 org.apache.spark.sql.AnalysisException: cannot resolve 'testData2.b' given 
 input columns key, value; line 1 pos 108
   at 
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
 {quote}
 {quote}SELECT * FROM testData LEFT SEMI JOIN testData2 ON key = a{quote} is 
 correct, 
 {quote}
 SELECT a FROM testData LEFT SEMI JOIN testData2 ON key = a
 SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = a group by b
 SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = testData2.a 
 group by testData2.b
 SELECT testData2.b, count(testData2.b) FROM testData LEFT SEMI JOIN testData2 
 ON key = testData2.a group by testData2.b
 {quote} are incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos

2015-03-23 Thread Littlestar (JIRA)
Littlestar created SPARK-6461:
-

 Summary: spark.executorEnv.PATH in spark-defaults.conf is not pass 
to mesos
 Key: SPARK-6461
 URL: https://issues.apache.org/jira/browse/SPARK-6461
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.3.0
Reporter: Littlestar


I use mesos run spak 1.3.0 ./run-example SparkPi
but failed.

spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
spark.executorEnv.PATH
spark.executorEnv.HADOOP_HOME
spark.executorEnv.JAVA_HOME

E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop fs 
-copyToLocal 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' 
'/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz'
sh: hadoop: command not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6430) Cannot resolve column correctlly when using left semi join

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6430:

Target Version/s: 1.3.1

 Cannot resolve column correctlly when using left semi join
 --

 Key: SPARK-6430
 URL: https://issues.apache.org/jira/browse/SPARK-6430
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
 Environment: Spark 1.3.0 on yarn mode
Reporter: zzc

 My code:
 {quote}
 case class TestData(key: Int, value: String)
 case class TestData2(a: Int, b: Int)
 import org.apache.spark.sql.execution.joins._
 import sqlContext.implicits._
 val testData = sc.parallelize(
 (1 to 100).map(i = TestData(i, i.toString))).toDF()
 testData.registerTempTable(testData)
 val testData2 = sc.parallelize(
   TestData2(1, 1) ::
   TestData2(1, 2) ::
   TestData2(2, 1) ::
   TestData2(2, 2) ::
   TestData2(3, 1) ::
   TestData2(3, 2) :: Nil, 2).toDF()
 testData2.registerTempTable(testData2)
 //val tmp = sqlContext.sql(SELECT * FROM testData *LEFT SEMI JOIN* testData2 
 ON key = a )
 val tmp = sqlContext.sql(SELECT testData2.b, count(testData2.b) FROM 
 testData *LEFT SEMI JOIN* testData2 ON key = testData2.a group by 
 testData2.b)
 tmp.explain()
 {quote}
 Error log:
 {quote}
 org.apache.spark.sql.AnalysisException: cannot resolve 'testData2.b' given 
 input columns key, value; line 1 pos 108
   at 
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.immutable.List.foreach(List.scala:318)
 {quote}
 {quote}SELECT * FROM testData LEFT SEMI JOIN testData2 ON key = a{quote} is 
 correct, 
 {quote}
 SELECT a FROM testData LEFT SEMI JOIN testData2 ON key = a
 SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = a group by b
 SELECT max(value) FROM testData LEFT SEMI JOIN testData2 ON key = testData2.a 
 group by testData2.b
 SELECT testData2.b, count(testData2.b) FROM testData LEFT SEMI JOIN testData2 
 ON key = testData2.a group by testData2.b
 {quote} are incorrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6456) Spark Sql throwing exception on large partitioned data

2015-03-23 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375481#comment-14375481
 ] 

Cheng Lian edited comment on SPARK-6456 at 3/23/15 7:21 AM:


How many partitions are there? Also, what's the version of the Hive metastore? 
For now, Spark SQL only support Hive 0.12.0 and 0.13.1. Spark 1.1 and prior 
versions only support Hive 0.12.0.


was (Author: lian cheng):
How many partitions are there?

 Spark Sql throwing exception on large partitioned data
 --

 Key: SPARK-6456
 URL: https://issues.apache.org/jira/browse/SPARK-6456
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: pankaj
 Fix For: 1.2.1


 Spark connects with Hive Metastore. I am able to run simple queries like show 
 table and select. but throws below exception while running query on the hive 
 Table having large number of partitions.
 {noformat}
 Exception in thread main java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
 at`enter code here` 
 org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 org.apache.thrift.transport.TTransportException: 
 java.net.SocketTimeoutException: Read timed out
 at 
 org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785)
 at 
 org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316)
 at 
 org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86)
 at 
 org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
 at scala.Option.getOrElse(Option.scala:120)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
 at 
 org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6463) [SPARK][SQL] AttributeSet.equal should compare size

2015-03-23 Thread June (JIRA)
June created SPARK-6463:
---

 Summary: [SPARK][SQL] AttributeSet.equal should compare size
 Key: SPARK-6463
 URL: https://issues.apache.org/jira/browse/SPARK-6463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: June
Priority: Minor


AttributeSet.equal should compare both member and size




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375468#comment-14375468
 ] 

Littlestar edited comment on SPARK-6461 at 3/23/15 8:39 AM:


each mesos slave node has JAVA and HADOOP DataNode.

I also add  the following setting to mesos-master-env.sh and mesos-slave-env.sh.
 export MESOS_JAVA_HOME=/home/test/jdk
 export MESOS_HADOOP_HOME=/home/test/hadoop-2.4.0
export MESOS_HADOOP_CONF_DIR=/home/test/hadoop-2.4.0/etc/hadoop
 export 
MESOS_PATH=/home/test/jdk/bin:/home/test/hadoop-2.4.0/sbin:/home/test/hadoop-2.4.0/bin:/sbin:/bin:/usr/sbin:/usr/bin

 /usr/bin/env: bash: No such file or directory

thanks.



was (Author: cnstar9988):
each mesos slave node has JAVA and HADOOP DataNode.

I also add  the following setting to mesos-master-env.sh and mesos-slave-env.sh.
 export MESOS_JAVA_HOME=/home/test/jdk
 export MESOS_HADOOP_HOME=/home/test/hadoop-2.4.0
 export 
MESOS_PATH=/home/test/jdk/bin:/home/test/hadoop-2.4.0/sbin:/home/test/hadoop-2.4.0/bin:/sbin:/bin:/usr/sbin:/usr/bin

 /usr/bin/env: bash: No such file or directory

thanks.


 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 --

 Key: SPARK-6461
 URL: https://issues.apache.org/jira/browse/SPARK-6461
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.3.0
Reporter: Littlestar

 I use mesos run spak 1.3.0 ./run-example SparkPi
 but failed.
 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 spark.executorEnv.PATH
 spark.executorEnv.HADOOP_HOME
 spark.executorEnv.JAVA_HOME
 E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop 
 fs -copyToLocal 
 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' 
 '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz'
 sh: hadoop: command not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6463) AttributeSet.equal should compare size

2015-03-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375581#comment-14375581
 ] 

Apache Spark commented on SPARK-6463:
-

User 'sisihj' has created a pull request for this issue:
https://github.com/apache/spark/pull/5133

 AttributeSet.equal should compare size
 --

 Key: SPARK-6463
 URL: https://issues.apache.org/jira/browse/SPARK-6463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: June
Priority: Minor

 AttributeSet.equal should compare both member and size



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375580#comment-14375580
 ] 

Littlestar edited comment on SPARK-6461 at 3/23/15 8:49 AM:


when I add MESOS_HADOOP_CONF_DIR at all mesos-master-env.sh and 
mesos-slave-env.sh , It throws the following error.
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/spark/executor/MesosExecutorBackend
 Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.executor.MesosExecutorBackend

similar to https://github.com/apache/spark/pull/620


was (Author: cnstar9988):
when I add MESOS_HADOOP_CONF_DIR at all mesos-master-env.sh and 
mesos-slave-env.sh , It throws the following error.
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/spark/executor/MesosExecutorBackend
 Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.executor.MesosExecutorBackend

 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 --

 Key: SPARK-6461
 URL: https://issues.apache.org/jira/browse/SPARK-6461
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.3.0
Reporter: Littlestar

 I use mesos run spak 1.3.0 ./run-example SparkPi
 but failed.
 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 spark.executorEnv.PATH
 spark.executorEnv.HADOOP_HOME
 spark.executorEnv.JAVA_HOME
 E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop 
 fs -copyToLocal 
 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' 
 '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz'
 sh: hadoop: command not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3720) support ORC in spark sql

2015-03-23 Thread iward (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375590#comment-14375590
 ] 

iward edited comment on SPARK-3720 at 3/23/15 9:05 AM:
---

hi,[~zhzhan] , I have the same problem.And I just contact orcFile on spark,I 
can not quite understand your patch ,I would like to ask you a few questions:
#1,why spark would read the whole files,what's the detail of problem on spark?
#2,could you tell me what should we do to solve the problem?
thanks


was (Author: iward):
hi,Zhan Zhang , I have the same problem.And I just contact orcFile on spark,I 
can not quite understand your patch ,I would like to ask you a few questions:
#1,why spark would read the whole files,what's the detail of problem on spark?
#2,could you tell me what should we do to solve the problem?
thanks

 support ORC in spark sql
 

 Key: SPARK-3720
 URL: https://issues.apache.org/jira/browse/SPARK-3720
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: Fei Wang
 Attachments: orc.diff


 The Optimized Row Columnar (ORC) file format provides a highly efficient way 
 to store data on hdfs.ORC file format has many advantages such as:
 1 a single file as the output of each task, which reduces the NameNode's load
 2 Hive type support including datetime, decimal, and the complex types 
 (struct, list, map, and union)
 3 light-weight indexes stored within the file
 skip row groups that don't pass predicate filtering
 seek to a given row
 4 block-mode compression based on data type
 run-length encoding for integer columns
 dictionary encoding for string columns
 5 concurrent reads of the same file using separate RecordReaders
 6 ability to split files without scanning for markers
 7 bound the amount of memory needed for reading or writing
 8 metadata stored using Protocol Buffers, which allows addition and removal 
 of fields
 Now spark sql support Parquet, support ORC provide people more opts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1702) Mesos executor won't start because of a ClassNotFoundException

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375599#comment-14375599
 ] 

Littlestar commented on SPARK-1702:
---

I met this on spak 1.3.0 + mesos 0.21.1

 Mesos executor won't start because of a ClassNotFoundException
 --

 Key: SPARK-1702
 URL: https://issues.apache.org/jira/browse/SPARK-1702
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Bouke van der Bijl
  Labels: executors, mesos, spark

 Some discussion here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html
 Fix here (which is probably not the right fix): 
 https://github.com/apache/spark/pull/620
 This was broken in v0.9.0, was fixed in v0.9.1 and is now broken again.
 Error in Mesos executor stderr:
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0502 17:31:42.672224 14688 exec.cpp:131] Version: 0.18.0
 I0502 17:31:42.674959 14707 exec.cpp:205] Executor registered on slave 
 20140501-182306-16842879-5050-10155-0
 14/05/02 17:31:42 INFO MesosExecutorBackend: Using Spark's default log4j 
 profile: org/apache/spark/log4j-defaults.properties
 14/05/02 17:31:42 INFO MesosExecutorBackend: Registered with Mesos as 
 executor ID 20140501-182306-16842879-5050-10155-0
 14/05/02 17:31:43 INFO SecurityManager: Changing view acls to: vagrant
 14/05/02 17:31:43 INFO SecurityManager: SecurityManager, is authentication 
 enabled: false are ui acls enabled: false users with view permissions: 
 Set(vagrant)
 14/05/02 17:31:43 INFO Slf4jLogger: Slf4jLogger started
 14/05/02 17:31:43 INFO Remoting: Starting remoting
 14/05/02 17:31:43 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@localhost:50843]
 14/05/02 17:31:43 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@localhost:50843]
 java.lang.ClassNotFoundException: org/apache/spark/serializer/JavaSerializer
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:165)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:176)
 at org.apache.spark.executor.Executor.init(Executor.scala:106)
 at 
 org.apache.spark.executor.MesosExecutorBackend.registered(MesosExecutorBackend.scala:56)
 Exception in thread Thread-0 I0502 17:31:43.710039 14707 exec.cpp:412] 
 Deactivating the executor libprocess
 The problem is that it can't find the class. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1702) Mesos executor won't start because of a ClassNotFoundException

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375599#comment-14375599
 ] 

Littlestar edited comment on SPARK-1702 at 3/23/15 9:05 AM:


I met this on spak 1.3.0 + mesos 0.21.1 with run-example SparkPi


was (Author: cnstar9988):
I met this on spak 1.3.0 + mesos 0.21.1

 Mesos executor won't start because of a ClassNotFoundException
 --

 Key: SPARK-1702
 URL: https://issues.apache.org/jira/browse/SPARK-1702
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Bouke van der Bijl
  Labels: executors, mesos, spark

 Some discussion here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html
 Fix here (which is probably not the right fix): 
 https://github.com/apache/spark/pull/620
 This was broken in v0.9.0, was fixed in v0.9.1 and is now broken again.
 Error in Mesos executor stderr:
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0502 17:31:42.672224 14688 exec.cpp:131] Version: 0.18.0
 I0502 17:31:42.674959 14707 exec.cpp:205] Executor registered on slave 
 20140501-182306-16842879-5050-10155-0
 14/05/02 17:31:42 INFO MesosExecutorBackend: Using Spark's default log4j 
 profile: org/apache/spark/log4j-defaults.properties
 14/05/02 17:31:42 INFO MesosExecutorBackend: Registered with Mesos as 
 executor ID 20140501-182306-16842879-5050-10155-0
 14/05/02 17:31:43 INFO SecurityManager: Changing view acls to: vagrant
 14/05/02 17:31:43 INFO SecurityManager: SecurityManager, is authentication 
 enabled: false are ui acls enabled: false users with view permissions: 
 Set(vagrant)
 14/05/02 17:31:43 INFO Slf4jLogger: Slf4jLogger started
 14/05/02 17:31:43 INFO Remoting: Starting remoting
 14/05/02 17:31:43 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@localhost:50843]
 14/05/02 17:31:43 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@localhost:50843]
 java.lang.ClassNotFoundException: org/apache/spark/serializer/JavaSerializer
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at org.apache.spark.SparkEnv$.instantiateClass$1(SparkEnv.scala:165)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:176)
 at org.apache.spark.executor.Executor.init(Executor.scala:106)
 at 
 org.apache.spark.executor.MesosExecutorBackend.registered(MesosExecutorBackend.scala:56)
 Exception in thread Thread-0 I0502 17:31:43.710039 14707 exec.cpp:412] 
 Deactivating the executor libprocess
 The problem is that it can't find the class. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd

2015-03-23 Thread SaintBacchus (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-6464:

Description: 
Nowadays, the transformation *coalesce* was always used to expand or reduce the 
number of the partition in order to gain a good performance.
But *coalesce* can't make sure that the child partition will be executed in the 
same executor as the parent partition. And this will lead to have a large 
network transfer.
In some scenario such as I mentioned in the title +small and cached rdd+, we 
want to coalesce all the partition in the same executor into one partition and 
make sure the child partition will be executed in this executor. It can avoid 
network transfer and reduce the scheduler of the Tasks and also can reused the 
cpu core to do other job. 
In this scenario, our performance had improved 20% than before.

  was:
Nowadays, the transformation *coalesce* was always used to expand or reduce the 
number of the partition in order to gain a good performance.
But *coalesce* can't make sure that the child partition will be executed in the 
same executor as the parent partition. And this will lead to have a large 
network transfer.
In some scenario such as I metioned in the title 


 Add a new transformation of rdd named processCoalesce which was  particularly 
 to deal with the small and cached rdd
 ---

 Key: SPARK-6464
 URL: https://issues.apache.org/jira/browse/SPARK-6464
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: SaintBacchus

 Nowadays, the transformation *coalesce* was always used to expand or reduce 
 the number of the partition in order to gain a good performance.
 But *coalesce* can't make sure that the child partition will be executed in 
 the same executor as the parent partition. And this will lead to have a large 
 network transfer.
 In some scenario such as I mentioned in the title +small and cached rdd+, we 
 want to coalesce all the partition in the same executor into one partition 
 and make sure the child partition will be executed in this executor. It can 
 avoid network transfer and reduce the scheduler of the Tasks and also can 
 reused the cpu core to do other job. 
 In this scenario, our performance had improved 20% than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput

2015-03-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6397:
--
Summary: Exclude virtual columns from QueryPlan.missingInput  (was: 
Override QueryPlan.missingInput when necessary and rely on CheckAnalysis)

 Exclude virtual columns from QueryPlan.missingInput
 ---

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yadong Qi
Assignee: Yadong Qi
Priority: Minor

 Currently, some LogicalPlans do not override missingInput, but they should. 
 Then, the lack of proper missingInput implementations leaks to CheckAnalysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6451) Support CombineSum in Code Gen

2015-03-23 Thread Venkata Ramana G (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375456#comment-14375456
 ] 

Venkata Ramana G commented on SPARK-6451:
-

Working on the same.

 Support CombineSum in Code Gen
 --

 Key: SPARK-6451
 URL: https://issues.apache.org/jira/browse/SPARK-6451
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Blocker

 Since we are using CombineSum at the reducer side for the SUM function, we 
 need to make it work in code gen. Otherwise, code gen will not convert 
 Aggregates with a SUM function to GeneratedAggregates (the code gen version).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6462) UpdateStateByKey should allow inner join of new with old keys

2015-03-23 Thread Andre Schumacher (JIRA)
Andre Schumacher created SPARK-6462:
---

 Summary: UpdateStateByKey should allow inner join of new with old 
keys
 Key: SPARK-6462
 URL: https://issues.apache.org/jira/browse/SPARK-6462
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.0
Reporter: Andre Schumacher



In a nutshell: provide a (inner join) instead of a cogroup for updateStateByKey 
in StateDStream.

Details:

It is common to read data (saw weblog data) from a streaming source (say Kafka) 
and each time update the state of a relatively small number of keys.

If only the state changes need to be propagated to a downstream sink then one 
could avoid filtering out unchanged state in the user program and instead 
provide this functionality in the API (say by adding a updateStateChangesByKey 
method).

Note that this is related but not identical to:
https://issues.apache.org/jira/browse/SPARK-2629



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1480) Choose classloader consistently inside of Spark codebase

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375576#comment-14375576
 ] 

Littlestar commented on SPARK-1480:
---

same as https://issues.apache.org/jira/browse/SPARK-6461

 run-example SparkPi can reproduce this bug.

 Choose classloader consistently inside of Spark codebase
 

 Key: SPARK-1480
 URL: https://issues.apache.org/jira/browse/SPARK-1480
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 The Spark codebase is not always consistent on which class loader it uses 
 when classlaoders are explicitly passed to things like serializers. This 
 caused SPARK-1403 and also causes a bug where when the driver has a modified 
 context class loader it is not translated correctly in local mode to the 
 (local) executor.
 In most cases what we want is the following behavior:
 1. If there is a context classloader on the thread, use that.
 2. Otherwise use the classloader that loaded Spark.
 We should just have a utility function for this and call that function 
 whenever we need to get a classloader.
 Note that SPARK-1403 is a workaround for this exact problem (it sets the 
 context class loader because downstream code assumes it is set). Once this 
 gets fixed in a more general way SPARK-1403 can be reverted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3720) support ORC in spark sql

2015-03-23 Thread iward (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375590#comment-14375590
 ] 

iward commented on SPARK-3720:
--

hi,Zhan Zhang , I have the same problem.And I just contact orcFile on spark,I 
can not quite understand your patch ,I would like to ask you a few questions:
#1,why spark would read the whole files,what's the detail of problem on spark?
#2,could you tell me what should we do to solve the problem?
thanks

 support ORC in spark sql
 

 Key: SPARK-3720
 URL: https://issues.apache.org/jira/browse/SPARK-3720
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: Fei Wang
 Attachments: orc.diff


 The Optimized Row Columnar (ORC) file format provides a highly efficient way 
 to store data on hdfs.ORC file format has many advantages such as:
 1 a single file as the output of each task, which reduces the NameNode's load
 2 Hive type support including datetime, decimal, and the complex types 
 (struct, list, map, and union)
 3 light-weight indexes stored within the file
 skip row groups that don't pass predicate filtering
 seek to a given row
 4 block-mode compression based on data type
 run-length encoding for integer columns
 dictionary encoding for string columns
 5 concurrent reads of the same file using separate RecordReaders
 6 ability to split files without scanning for markers
 7 bound the amount of memory needed for reading or writing
 8 metadata stored using Protocol Buffers, which allows addition and removal 
 of fields
 Now spark sql support Parquet, support ORC provide people more opts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375468#comment-14375468
 ] 

Littlestar commented on SPARK-6461:
---

each mesos slave node has JAVA and HADOOP DataNode.

I also add  the following setting to mesos-master-env.sh and mesos-slave-env.sh.
 export MESOS_JAVA_HOME=/home/test/jdk
 export MESOS_HADOOP_HOME=/home/test/hadoop-2.4.0
 export 
MESOS_PATH=/home/test/jdk/bin:/home/test/hadoop-2.4.0/sbin:/home/test/hadoop-2.4.0/bin:/sbin:/bin:/usr/sbin:/usr/bin

 /usr/bin/env: bash: No such file or directory

thanks.


 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 --

 Key: SPARK-6461
 URL: https://issues.apache.org/jira/browse/SPARK-6461
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.3.0
Reporter: Littlestar

 I use mesos run spak 1.3.0 ./run-example SparkPi
 but failed.
 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 spark.executorEnv.PATH
 spark.executorEnv.HADOOP_HOME
 spark.executorEnv.JAVA_HOME
 E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop 
 fs -copyToLocal 
 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' 
 '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz'
 sh: hadoop: command not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6443) Could not submit app in standalone cluster mode when HA is enabled

2015-03-23 Thread Tao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang updated SPARK-6443:

Description: 
After digging some codes, I found user could not submit app in standalone 
cluster mode when HA is enabled. But in client mode it can work.

Haven't try yet. But I will verify this and file a PR to resolve it if the 
problem exists.

3/23 update:
I started a HA cluster with zk, and tried to submit SparkPi example with 
command:
./spark-submit  --class org.apache.spark.examples.SparkPi --master 
spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
../lib/spark-examples-1.2.0-hadoop2.4.0.jar 

and it failed with error message:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
akka.actor.ActorInitializationException: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
at akka.actor.ActorCell.create(ActorCell.scala:596)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.spark.SparkException: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
at akka.actor.ActorCell.create(ActorCell.scala:580)
... 9 more

But in client mode it ended with correct result. So my guess is right. I will 
fix it in the related PR.



  was:
After digging some codes, I found user could not submit app in standalone 
cluster mode when HA is enabled. But in client mode it can work.

Haven't try yet. But I will verify this and file a PR to resolve it if the 
problem exists.

3/23 update:
I started a HA cluster with zk, and tried to submit SparkPi example with 
command:
*./spark-submit  --class org.apache.spark.examples.SparkPi --master 
spark://doggie153:7077,doggie159:7077 --deploy-mode cluster 
../lib/spark-examples-1.2.0-hadoop2.4.0.jar *

and it failed with error message:
??Spark assembly has been built with Hive, including Datanucleus jars on 
classpath
15/03/23 15:24:45 ERROR actor.OneForOneStrategy: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
akka.actor.ActorInitializationException: exception during creation
at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
at akka.actor.ActorCell.create(ActorCell.scala:596)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.spark.SparkException: Invalid master URL: 
spark://doggie153:7077,doggie159:7077
at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
at akka.actor.ActorCell.create(ActorCell.scala:580)
... 9 more??

So my guess is right. I will fix it in related PR.




 Could not submit app in standalone cluster mode when HA is enabled
 --

 Key: SPARK-6443
 URL: https://issues.apache.org/jira/browse/SPARK-6443
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Reporter: 

[jira] [Commented] (SPARK-1403) Spark on Mesos does not set Thread's context class loader

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375583#comment-14375583
 ] 

Littlestar commented on SPARK-1403:
---

I want to reopen this bug, because I can reproduce it at spark 1.3.0 + mesos 
0.21.1 with run-example SparkPi 


 Spark on Mesos does not set Thread's context class loader
 -

 Key: SPARK-1403
 URL: https://issues.apache.org/jira/browse/SPARK-1403
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: ubuntu 12.04 on vagrant
Reporter: Bharath Bhushan
Priority: Blocker
 Fix For: 1.0.0


 I can run spark 0.9.0 on mesos but not spark 1.0.0. This is because the spark 
 executor on mesos slave throws a  java.lang.ClassNotFoundException for 
 org.apache.spark.serializer.JavaSerializer.
 The lengthy discussion is here: 
 http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html#a3513



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd

2015-03-23 Thread SaintBacchus (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-6464:

Description: 
Nowadays, the transformation *coalesce* was always used to expand or reduce the 
number of the partition in order to gain a good performance.
But *coalesce* can't make sure that the child partition will be executed in the 
same executor as the parent partition. And this will lead to have a large 
network transfer.
In some scenario such as I metioned in the title 

 Add a new transformation of rdd named processCoalesce which was  particularly 
 to deal with the small and cached rdd
 ---

 Key: SPARK-6464
 URL: https://issues.apache.org/jira/browse/SPARK-6464
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: SaintBacchus

 Nowadays, the transformation *coalesce* was always used to expand or reduce 
 the number of the partition in order to gain a good performance.
 But *coalesce* can't make sure that the child partition will be executed in 
 the same executor as the parent partition. And this will lead to have a large 
 network transfer.
 In some scenario such as I metioned in the title 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6466) Remove unnecessary attributes when resolving GroupingSets

2015-03-23 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-6466:
--

 Summary: Remove unnecessary attributes when resolving GroupingSets
 Key: SPARK-6466
 URL: https://issues.apache.org/jira/browse/SPARK-6466
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor


When resolving GroupingSets, we currently list all outputs of GroupingSets's 
child plan. However, the columns that are not in groupBy expressions and not 
used by aggregation expressions are unnecessary and can be removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6397) Override QueryPlan.missingInput when necessary and rely on CheckAnalysis

2015-03-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6397:
--
Affects Version/s: 1.3.0

 Override QueryPlan.missingInput when necessary and rely on CheckAnalysis
 

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yadong Qi
Assignee: Yadong Qi
Priority: Minor

 Currently, some LogicalPlans do not override missingInput, but they should. 
 Then, the lack of proper missingInput implementations leaks to CheckAnalysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6397) Override QueryPlan.missingInput when necessary and rely on CheckAnalysis

2015-03-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6397:
--
Assignee: Yadong Qi

 Override QueryPlan.missingInput when necessary and rely on CheckAnalysis
 

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yadong Qi
Assignee: Yadong Qi
Priority: Minor

 Currently, some LogicalPlans do not override missingInput, but they should. 
 Then, the lack of proper missingInput implementations leaks to CheckAnalysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375580#comment-14375580
 ] 

Littlestar commented on SPARK-6461:
---

when I add MESOS_HADOOP_CONF_DIR at all mesos-master-env.sh and 
mesos-slave-env.sh , It throws the following error.
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/spark/executor/MesosExecutorBackend
 Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.executor.MesosExecutorBackend

 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 --

 Key: SPARK-6461
 URL: https://issues.apache.org/jira/browse/SPARK-6461
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.3.0
Reporter: Littlestar

 I use mesos run spak 1.3.0 ./run-example SparkPi
 but failed.
 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 spark.executorEnv.PATH
 spark.executorEnv.HADOOP_HOME
 spark.executorEnv.JAVA_HOME
 E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop 
 fs -copyToLocal 
 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' 
 '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz'
 sh: hadoop: command not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd

2015-03-23 Thread SaintBacchus (JIRA)
SaintBacchus created SPARK-6464:
---

 Summary: Add a new transformation of rdd named processCoalesce 
which was  particularly to deal with the small and cached rdd
 Key: SPARK-6464
 URL: https://issues.apache.org/jira/browse/SPARK-6464
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: SaintBacchus






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6464) Add a new transformation of rdd named processCoalesce which was particularly to deal with the small and cached rdd

2015-03-23 Thread SaintBacchus (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-6464:

Description: 
Nowadays, the transformation *coalesce* was always used to expand or reduce the 
number of the partition in order to gain a good performance.
But *coalesce* can't make sure that the child partition will be executed in the 
same executor as the parent partition. And this will lead to have a large 
network transfer.
In some scenario such as I mentioned in the title +small and cached rdd+, we 
want to coalesce all the partition in the same executor into one partition and 
make sure the child partition will be executed in this executor. It can avoid 
network transfer and reduce the scheduler of the Tasks and also can reused the 
cpu core to do other job. 
In this scenario, our performance had improved 20% than before.


  was:
Nowadays, the transformation *coalesce* was always used to expand or reduce the 
number of the partition in order to gain a good performance.
But *coalesce* can't make sure that the child partition will be executed in the 
same executor as the parent partition. And this will lead to have a large 
network transfer.
In some scenario such as I mentioned in the title +small and cached rdd+, we 
want to coalesce all the partition in the same executor into one partition and 
make sure the child partition will be executed in this executor. It can avoid 
network transfer and reduce the scheduler of the Tasks and also can reused the 
cpu core to do other job. 
In this scenario, our performance had improved 20% than before.


 Add a new transformation of rdd named processCoalesce which was  particularly 
 to deal with the small and cached rdd
 ---

 Key: SPARK-6464
 URL: https://issues.apache.org/jira/browse/SPARK-6464
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: SaintBacchus

 Nowadays, the transformation *coalesce* was always used to expand or reduce 
 the number of the partition in order to gain a good performance.
 But *coalesce* can't make sure that the child partition will be executed in 
 the same executor as the parent partition. And this will lead to have a large 
 network transfer.
 In some scenario such as I mentioned in the title +small and cached rdd+, we 
 want to coalesce all the partition in the same executor into one partition 
 and make sure the child partition will be executed in this executor. It can 
 avoid network transfer and reduce the scheduler of the Tasks and also can 
 reused the cpu core to do other job. 
 In this scenario, our performance had improved 20% than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375468#comment-14375468
 ] 

Littlestar edited comment on SPARK-6461 at 3/23/15 9:29 AM:


each mesos slave node has JAVA and HADOOP DataNode.

Now I add  the following setting to mesos-master-env.sh and mesos-slave-env.sh.
 export MESOS_JAVA_HOME=/home/test/jdk
 export MESOS_HADOOP_HOME=/home/test/hadoop-2.4.0
export MESOS_HADOOP_CONF_DIR=/home/test/hadoop-2.4.0/etc/hadoop
 export 
MESOS_PATH=/home/test/jdk/bin:/home/test/hadoop-2.4.0/sbin:/home/test/hadoop-2.4.0/bin:/sbin:/bin:/usr/sbin:/usr/bin

 /usr/bin/env: bash: No such file or directory

thanks.



was (Author: cnstar9988):
each mesos slave node has JAVA and HADOOP DataNode.

I also add  the following setting to mesos-master-env.sh and mesos-slave-env.sh.
 export MESOS_JAVA_HOME=/home/test/jdk
 export MESOS_HADOOP_HOME=/home/test/hadoop-2.4.0
export MESOS_HADOOP_CONF_DIR=/home/test/hadoop-2.4.0/etc/hadoop
 export 
MESOS_PATH=/home/test/jdk/bin:/home/test/hadoop-2.4.0/sbin:/home/test/hadoop-2.4.0/bin:/sbin:/bin:/usr/sbin:/usr/bin

 /usr/bin/env: bash: No such file or directory

thanks.


 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 --

 Key: SPARK-6461
 URL: https://issues.apache.org/jira/browse/SPARK-6461
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.3.0
Reporter: Littlestar

 I use mesos run spak 1.3.0 ./run-example SparkPi
 but failed.
 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 spark.executorEnv.PATH
 spark.executorEnv.HADOOP_HOME
 spark.executorEnv.JAVA_HOME
 E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop 
 fs -copyToLocal 
 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' 
 '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz'
 sh: hadoop: command not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput

2015-03-23 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375680#comment-14375680
 ] 

Cheng Lian commented on SPARK-6397:
---

Hey [~smolav], after some discussion with [~waterman] in his PRs, we decided to 
fix the GROUPING__ID virtual column issue first. So I updated the title and 
description of this JIRA ticket, and created SPARK-6467 for the original one. 
You may link your PR to that one. Thanks! I should have created another JIRA 
ticket for the fix introduced in [~waterman]'s PR, but I realized the problem 
too late after merging it.

 Exclude virtual columns from QueryPlan.missingInput
 ---

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yadong Qi
Assignee: Yadong Qi
Priority: Minor

 Virtual columns like GROUPING__ID should never be considered as missing 
 input, and thus should be execluded from {{QueryPlan.missingInput}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6397) Exclude virtual columns from QueryPlan.missingInput

2015-03-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6397.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 5132
[https://github.com/apache/spark/pull/5132]

 Exclude virtual columns from QueryPlan.missingInput
 ---

 Key: SPARK-6397
 URL: https://issues.apache.org/jira/browse/SPARK-6397
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yadong Qi
Assignee: Yadong Qi
Priority: Minor
 Fix For: 1.3.1, 1.4.0


 Virtual columns like GROUPING__ID should never be considered as missing 
 input, and thus should be execluded from {{QueryPlan.missingInput}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6456) Spark Sql throwing exception on large partitioned data

2015-03-23 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6456:
--
Description: 
Spark connects with Hive Metastore. I am able to run simple queries like show 
table and select. but throws below exception while running query on the hive 
Table having large number of partitions.
{noformat}
Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
at`enter code here` 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.thrift.transport.TTransportException: 
java.net.SocketTimeoutException: Read timed out
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785)
at 
org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86)
at 
org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
at 
org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
{noformat}

  was:
Observation:
Spark connects with hive Metastore. i am able to run simple queries like 
show table and select.
but throws below exception while running query on the hive Table having large 
number of partitions.

{code}
Exception in thread main java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
at`enter code here` 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
org.apache.thrift.transport.TTransportException: 
java.net.SocketTimeoutException: Read timed out
at 
org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785)
at 
org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86)
at 
org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
at 
org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143)
at 

[jira] [Commented] (SPARK-6456) Spark Sql throwing exception on large partitioned data

2015-03-23 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375481#comment-14375481
 ] 

Cheng Lian commented on SPARK-6456:
---

How many partitions are there?

 Spark Sql throwing exception on large partitioned data
 --

 Key: SPARK-6456
 URL: https://issues.apache.org/jira/browse/SPARK-6456
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: pankaj
 Fix For: 1.2.1


 Spark connects with Hive Metastore. I am able to run simple queries like show 
 table and select. but throws below exception while running query on the hive 
 Table having large number of partitions.
 {noformat}
 Exception in thread main java.lang.reflect.InvocationTargetException
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:40)
 at`enter code here` 
 org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
 org.apache.thrift.transport.TTransportException: 
 java.net.SocketTimeoutException: Read timed out
 at 
 org.apache.hadoop.hive.ql.metadata.Hive.getAllPartitionsOf(Hive.java:1785)
 at 
 org.apache.spark.sql.hive.HiveShim$.getAllPartitionsOf(Shim13.scala:316)
 at 
 org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:86)
 at 
 org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:253)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:137)
 at scala.Option.getOrElse(Option.scala:120)
 at 
 org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:137)
 at 
 org.apache.spark.sql.hive.HiveContext$$anon$1.lookupRelation(HiveContext.scala:253)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:143)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$5.applyOrElse(Analyzer.scala:138)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5320) Joins on simple table created using select gives error

2015-03-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5320:

Assignee: Yuri Saito

 Joins on simple table created using select gives error
 --

 Key: SPARK-5320
 URL: https://issues.apache.org/jira/browse/SPARK-5320
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1
Reporter: Kuldeep
Assignee: Yuri Saito
 Fix For: 1.3.1, 1.4.0


 Register select 0 as a, 1 as b as table zeroone
 Register select 0 as x, 1 as y as table zeroone2
 The following sql 
 select * from zeroone ta join zeroone2 tb on ta.a = tb.x
 gives error 
 java.lang.UnsupportedOperationException: LeafNode NoRelation$ must implement 
 statistics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6461) spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos

2015-03-23 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14375478#comment-14375478
 ] 

Littlestar commented on SPARK-6461:
---

in spark/bin, some shell script use usr/bin/env bash

I think  changed #!/usr/bin/env bash to #!/bin/bash and that worked.

 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 --

 Key: SPARK-6461
 URL: https://issues.apache.org/jira/browse/SPARK-6461
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.3.0
Reporter: Littlestar

 I use mesos run spak 1.3.0 ./run-example SparkPi
 but failed.
 spark.executorEnv.PATH in spark-defaults.conf is not pass to mesos
 spark.executorEnv.PATH
 spark.executorEnv.HADOOP_HOME
 spark.executorEnv.JAVA_HOME
 E0323 14:24:36.400635 11355 fetcher.cpp:109] HDFS copyToLocal failed: hadoop 
 fs -copyToLocal 
 'hdfs://192.168.1.9:54310/home/test/spark-1.3.0-bin-2.4.0.tar.gz' 
 '/home/mesos/work_dir/slaves/20150323-100710-1214949568-5050-3453-S3/frameworks/20150323-133400-1214949568-5050-15440-0007/executors/20150323-100710-1214949568-5050-3453-S3/runs/915b40d8-f7c4-428a-9df8-ac9804c6cd21/spark-1.3.0-bin-2.4.0.tar.gz'
 sh: hadoop: command not found



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >