date:20160516

[jira] [Comment Edited] (SPARK-15227) InputStream stop-start semantics + empty implementations

2016-05-16 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286089#comment-15286089
 ] 

Prashant Sharma edited comment on SPARK-15227 at 5/17/16 5:44 AM:
--

If start and stop are overridden by a particular DStream, they are called when 
the streams are started(to do some intialization) and stopped (to do so 
cleanup). However, if there is nothing to initialize and cleanup - then they 
can be left empty. 

Pause and resume is very different from start and stop. For example, if you 
pause - what happens to the incoming stream. They are buffered or they are 
dropped ? Those semantics need to be discussed, before we can talk about that. 
It is possible to implement it by having a custom receiver.

I am not sure, since the development efforts are shifted towards the structured 
streaming, it will be interesting to see - how this sort of thing gets 
implemented.




was (Author: prashant_):
If start and stop are overridden by a particular DStream, they are called when 
the streams are started(to do some intialization) and stopped (to do so 
cleanup). However, if there is nothing to initialize and cleanup - then they 
can be left empty. 

Pause and resume is very different from start and stop. For example, if you 
pause - what happens to the incoming stream. They are buffered or they are 
dropped ? Those semantics need to be discussed, before we can talk about that. 
It is possible to implement it by having a custom receiver.

I am not sure, but since the development efforts are shifted towards the 
structured streaming, it will be interesting to see - how this sort of thing 
gets implemented.



> InputStream stop-start semantics + empty implementations
> 
>
> Key: SPARK-15227
> URL: https://issues.apache.org/jira/browse/SPARK-15227
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Streaming
>Affects Versions: 1.6.1
>Reporter: Stas Levin
>Priority: Minor
>
> Hi,
> Seems like quite a few InputStream(s) currently leave the start and stop 
> methods empty. 
> I was hoping to hear your thoughts on:
> 1. Whether there were any particular reasons to leave these methods empty ?
> 2. Do the stop/start semantics of InputStream(s) aim to support pause-resume 
> use-cases, or is it a one way ticket? 
> A pause-resume kind of thing could be really useful for cases where one 
> wishes to load new offline data for the streaming app to run on top of, 
> without restarting the entire app.
> Thanks a lot,
> Stas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15227) InputStream stop-start semantics + empty implementations

2016-05-16 Thread Prashant Sharma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286089#comment-15286089
 ] 

Prashant Sharma commented on SPARK-15227:
-

If start and stop are overridden by a particular DStream, they are called when 
the streams are started(to do some intialization) and stopped (to do so 
cleanup). However, if there is nothing to initialize and cleanup - then they 
can be left empty. 

Pause and resume is very different from start and stop. For example, if you 
pause - what happens to the incoming stream. They are buffered or they are 
dropped ? Those semantics need to be discussed, before we can talk about that. 
It is possible to implement it by having a custom receiver.

I am not sure, but since the development efforts are shifted towards the 
structured streaming, it will be interesting to see - how this sort of thing 
gets implemented.



> InputStream stop-start semantics + empty implementations
> 
>
> Key: SPARK-15227
> URL: https://issues.apache.org/jira/browse/SPARK-15227
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output, Streaming
>Affects Versions: 1.6.1
>Reporter: Stas Levin
>Priority: Minor
>
> Hi,
> Seems like quite a few InputStream(s) currently leave the start and stop 
> methods empty. 
> I was hoping to hear your thoughts on:
> 1. Whether there were any particular reasons to leave these methods empty ?
> 2. Do the stop/start semantics of InputStream(s) aim to support pause-resume 
> use-cases, or is it a one way ticket? 
> A pause-resume kind of thing could be really useful for cases where one 
> wishes to load new offline data for the streaming app to run on top of, 
> without restarting the entire app.
> Thanks a lot,
> Stas



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15344) Unable to set default log level for PySpark

2016-05-16 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286039#comment-15286039
 ] 

Felix Cheung commented on SPARK-15344:
--

This was the original change: https://issues.apache.org/jira/browse/SPARK-11929


> Unable to set default log level for PySpark
> ---
>
> Key: SPARK-15344
> URL: https://issues.apache.org/jira/browse/SPARK-15344
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> After this patch:
> https://github.com/apache/spark/pull/12648
> I'm unable to set default log level for Pyspark.
> It's always WARN.
> Below setting doesn't work: 
> {code}
> mbrynski@jupyter:~/spark$ cat conf/log4j.properties
> # Set everything to be logged to the console
> log4j.rootCategory=INFO, console
> log4j.appender.console=org.apache.log4j.ConsoleAppender
> log4j.appender.console.target=System.err
> log4j.appender.console.layout=org.apache.log4j.PatternLayout
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
> %c{1}: %m%n
> # Set the default spark-shell log level to WARN. When running the 
> spark-shell, the
> # log level for this class is used to overwrite the root logger's log level, 
> so that
> # the user can have different defaults for the shell and regular Spark apps.
> log4j.logger.org.apache.spark.repl.Main=INFO
> # Settings to quiet third party logs that are too verbose
> log4j.logger.org.spark_project.jetty=WARN
> log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
> log4j.logger.org.apache.parquet=ERROR
> log4j.logger.parquet=ERROR
> # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
> UDFs in SparkSQL with Hive support
> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15344) Unable to set default log level for PySpark

2016-05-16 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286038#comment-15286038
 ] 

Felix Cheung commented on SPARK-15344:
--

SPARK-14881 was to get pyspark and sparkR shell to match the new default 
behavior of spark-shell (Scala). 
As you can see here, it will always set the default to WARN: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala#L135

I agree it makes sense if log4j-defaults.properties is there we should keep log 
level set there, for all shell/REPL cases.

> Unable to set default log level for PySpark
> ---
>
> Key: SPARK-15344
> URL: https://issues.apache.org/jira/browse/SPARK-15344
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> After this patch:
> https://github.com/apache/spark/pull/12648
> I'm unable to set default log level for Pyspark.
> It's always WARN.
> Below setting doesn't work: 
> {code}
> mbrynski@jupyter:~/spark$ cat conf/log4j.properties
> # Set everything to be logged to the console
> log4j.rootCategory=INFO, console
> log4j.appender.console=org.apache.log4j.ConsoleAppender
> log4j.appender.console.target=System.err
> log4j.appender.console.layout=org.apache.log4j.PatternLayout
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
> %c{1}: %m%n
> # Set the default spark-shell log level to WARN. When running the 
> spark-shell, the
> # log level for this class is used to overwrite the root logger's log level, 
> so that
> # the user can have different defaults for the shell and regular Spark apps.
> log4j.logger.org.apache.spark.repl.Main=INFO
> # Settings to quiet third party logs that are too verbose
> log4j.logger.org.spark_project.jetty=WARN
> log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
> log4j.logger.org.apache.parquet=ERROR
> log4j.logger.parquet=ERROR
> # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
> UDFs in SparkSQL with Hive support
> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13850) TimSort Comparison method violates its general contract

2016-05-16 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285980#comment-15285980
 ] 

Yin Huai commented on SPARK-13850:
--

Can you explain the root cause at here?

> TimSort Comparison method violates its general contract
> ---
>
> Key: SPARK-13850
> URL: https://issues.apache.org/jira/browse/SPARK-13850
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.0
>Reporter: Sital Kedia
>
> While running a query which does a group by on a large dataset, the query 
> fails with following stack trace. 
> {code}
> Job aborted due to stage failure: Task 4077 in stage 1.3 failed 4 times, most 
> recent failure: Lost task 4077.3 in stage 1.3 (TID 88702, 
> hadoop3030.prn2.facebook.com): java.lang.IllegalArgumentException: Comparison 
> method violates its general contract!
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:794)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453)
>   at 
> org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:228)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:186)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:318)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:333)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Please note that the same query used to succeed in Spark 1.5 so it seems like 
> a regression in 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15292) ML 2.0 QA: Scala APIs audit for classification

2016-05-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15292:
--
Assignee: Yanbo Liang
Target Version/s: 2.0.0

> ML 2.0 QA: Scala APIs audit for classification
> --
>
> Key: SPARK-15292
> URL: https://issues.apache.org/jira/browse/SPARK-15292
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Audit Scala API for classification, almost all issues were related 
> MultilayerPerceptronClassifier.
> * Fix one wrong param getter/setter method: getOptimizer -> getSolver
> * Add missing setter for "solver" and "stepSize".
> * Make GD solver take effect.
> * Update docs, annotations and fix other minor issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15269) Creating external table leaves empty directory under warehouse directory

2016-05-16 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-15269:
---
Assignee: Xin Wu

> Creating external table leaves empty directory under warehouse directory
> 
>
> Key: SPARK-15269
> URL: https://issues.apache.org/jira/browse/SPARK-15269
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Xin Wu
>
> Adding the following test case in {{HiveDDLSuite}} may reproduce this issue:
> {code}
>   test("foo") {
> withTempPath { dir =>
>   val path = dir.getCanonicalPath
>   spark.range(1).write.json(path)
>   withTable("ddl_test1") {
> sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')")
> sql("DROP TABLE ddl_test1")
> sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a")
>   }
> }
>   }
> {code}
> Note that the first {{CREATE TABLE}} command creates an external table since 
> data source tables are always external when {{PATH}} option is specified.
> When executing the second {{CREATE TABLE}} command, which creates a managed 
> table with the same name, it fails because there's already an unexpected 
> directory with the same name as the table name in the warehouse directory:
> {noformat}
> [info] - foo *** FAILED *** (7 seconds, 649 milliseconds)
> [info]   org.apache.spark.sql.AnalysisException: path 
> file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1
>  already exists.;
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417)
> [info]   at 
> org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:231)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> [info]   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> [info]   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> [info]   at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> [info]   at org.apache.spark.sql.Dataset.(Dataset.scala:186)
> [info]   at org.apache.spark.sql.Dataset.(Dataset.scala:167)
> [info]   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62)
> [info]   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:59)
> [info]   at 
> org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:59)
> [info]   at 
>

[jira] [Updated] (SPARK-15357) Cooperative spilling should check consumer memory mode

2016-05-16 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-15357:
--
Description: 
In TaskMemoryManager.java:
{code}
for (MemoryConsumer c: consumers) {
  if (c != consumer && c.getUsed() > 0) {
try {
  long released = c.spill(required - got, consumer);
if (released > 0 && mode == tungstenMemoryMode) {
  got += memoryManager.acquireExecutionMemory(required - got, 
taskAttemptId, mode);
  if (got >= required) {
break;
  }
}
  } catch(...) { ... }
}
  }
}
{code}
Currently, when non-tungsten consumers acquire execution memory, they may force 
other tungsten consumers to spill and then NOT use the freed memory. A better 
way to do this is to incorporate the memory mode in the consumer itself and 
spill only those with matching memory modes.

  was:
In TaskMemoryManager.java:
{code}
for (MemoryConsumer c: consumers) {
  if (c != consumer && c.getUsed() > 0) {
try {
  long released = c.spill(required - got, consumer);
  if (released > 0 && mode == tungstenMemoryMode) {
logger.debug("Task {} released {} from {} for {}", 
taskAttemptId,
  Utils.bytesToString(released), c, consumer);
got += memoryManager.acquireExecutionMemory(required - got, 
taskAttemptId, mode);
if (got >= required) {
  break;
}
  }
} catch (IOException e) { ... }
  }
}
{code}
Currently, when non-tungsten consumers acquire execution memory, they may force 
other tungsten consumers to spill and then NOT use the freed memory. A better 
way to do this is to incorporate the memory mode in the consumer itself and 
spill only those with matching memory modes.


> Cooperative spilling should check consumer memory mode
> --
>
> Key: SPARK-15357
> URL: https://issues.apache.org/jira/browse/SPARK-15357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>
> In TaskMemoryManager.java:
> {code}
> for (MemoryConsumer c: consumers) {
>   if (c != consumer && c.getUsed() > 0) {
> try {
>   long released = c.spill(required - got, consumer);
> if (released > 0 && mode == tungstenMemoryMode) {
>   got += memoryManager.acquireExecutionMemory(required - got, 
> taskAttemptId, mode);
>   if (got >= required) {
> break;
>   }
> }
>   } catch(...) { ... }
> }
>   }
> }
> {code}
> Currently, when non-tungsten consumers acquire execution memory, they may 
> force other tungsten consumers to spill and then NOT use the freed memory. A 
> better way to do this is to incorporate the memory mode in the consumer 
> itself and spill only those with matching memory modes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15357) Cooperative spilling should check consumer memory mode

2016-05-16 Thread Andrew Or (JIRA)

Andrew Or created SPARK-15357:
-

 Summary: Cooperative spilling should check consumer memory mode
 Key: SPARK-15357
 URL: https://issues.apache.org/jira/browse/SPARK-15357
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Andrew Or


In TaskMemoryManager.java:
{code}
for (MemoryConsumer c: consumers) {
  if (c != consumer && c.getUsed() > 0) {
try {
  long released = c.spill(required - got, consumer);
  if (released > 0 && mode == tungstenMemoryMode) {
logger.debug("Task {} released {} from {} for {}", 
taskAttemptId,
  Utils.bytesToString(released), c, consumer);
got += memoryManager.acquireExecutionMemory(required - got, 
taskAttemptId, mode);
if (got >= required) {
  break;
}
  }
} catch (IOException e) { ... }
  }
}
{code}
Currently, when non-tungsten consumers acquire execution memory, they may force 
other tungsten consumers to spill and then NOT use the freed memory. A better 
way to do this is to incorporate the memory mode in the consumer itself and 
spill only those with matching memory modes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14752) LazilyGenerateOrdering throws NullPointerException

2016-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285675#comment-15285675
 ] 

Apache Spark commented on SPARK-14752:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/13141

> LazilyGenerateOrdering throws NullPointerException
> --
>
> Key: SPARK-14752
> URL: https://issues.apache.org/jira/browse/SPARK-14752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> codebase: spark master
> DataSet: TPC-DS
> Client: $SPARK_HOME/bin/beeline
> Example query to reproduce the issue:  
> select i_item_id from item order by i_item_id limit 10;
> Explain plan output
> {noformat}
> explain select i_item_id from item order by i_item_id limit 10;
> +--+--+
> | 
> plan  
>   
>  |
> +--+--+
> | == Physical Plan ==
> TakeOrderedAndProject(limit=10, orderBy=[i_item_id#1229 ASC], 
> output=[i_item_id#1229])
> +- WholeStageCodegen
>:  +- Project [i_item_id#1229]
>: +- Scan HadoopFiles[i_item_id#1229] Format: ORC, PushedFilters: [], 
> ReadSchema: struct  |
> +--+--+
> {noformat}
> Exception:
> {noformat}
> TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> underlying (org.apache.spark.util.BoundedPriorityQueue)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25)
>   at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19)
>   at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312)
>   at 
> org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1791)
>   at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148)
>   at scala.math.Ordering$$anon$4.compare(Ordering.scala:111)
>   at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669)
>   at java.util.PriorityQueue.siftUp(PriorityQueue.java:645)
>   at java.util.PriorityQueue.offer(PriorityQueue.java:344)
>   at java.util.PriorityQueue.add(PriorityQueue.java:321)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78)
>   at 
> com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31)
>   at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708)
>   at 
>

[jira] [Commented] (SPARK-14817) ML, Graph, R 2.0 QA: Programming guide update and migration guide

2016-05-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285626#comment-15285626
 ] 

Joseph K. Bradley commented on SPARK-14817:
---

Migration guide needs to note change from [SPARK-14814]'s PR

> ML, Graph, R 2.0 QA: Programming guide update and migration guide
> -
>
> Key: SPARK-14817
> URL: https://issues.apache.org/jira/browse/SPARK-14817
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>
> Before the release, we need to update the MLlib, GraphX, and SparkR 
> Programming Guides.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-13448].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> For MLlib, we will make the DataFrame-based API (spark.ml) front-and-center, 
> to make it clear the RDD-based API is the older, maintenance-mode one.
> * No docs for spark.mllib will be deleted; they will just be reorganized and 
> put in a subsection.
> * If spark.ml docs are less complete, or if spark.ml docs say "refer to the 
> spark.mllib docs for details," then we should copy those details to the 
> spark.ml docs.  This per-feature work can happen under [SPARK-14815].
> * This big reorganization should be done *after* docs are added for each 
> feature (to minimize merge conflicts).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14814) ML 2.0 QA: API: Java compatibility, docs

2016-05-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14814.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Given your review + the Java fix, I'll mark this as done.  Thanks!

> ML 2.0 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-14814
> URL: https://issues.apache.org/jira/browse/SPARK-14814
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
> Fix For: 2.0.0
>
>
> Check Java compatibility for MLlib for this release.
> Checking compatibility means:
> * comparing with the Scala doc
> * verifying that Java docs are not messed up by Scala type incompatibilities. 
>  Some items to look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (The correctness can be checked in Scala.)
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> Note that we should not break APIs from previous releases.  So if you find a 
> problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15194) Add Python ML API for MultivariateGaussian

2016-05-16 Thread praveen dareddy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285618#comment-15285618
 ] 

praveen dareddy commented on SPARK-15194:
-

[~josephkb] Thanks for clarifying this.
I will continue work on this issue once the blocker issue SPARK-14906 is merged 
to the master.

Thanks,
praveen

> Add Python ML API for MultivariateGaussian
> --
>
> Key: SPARK-15194
> URL: https://issues.apache.org/jira/browse/SPARK-15194
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> We have a PySpark API for the MLLib version but not the ML version. This 
> would allow Python's  `GaussianMixture` to more closely match the Scala API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14810) ML, Graph 2.0 QA: API: Binary incompatible changes

2016-05-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285613#comment-15285613
 ] 

Joseph K. Bradley commented on SPARK-14810:
---

[~nick.pentre...@gmail.com] Thanks!  Your judgements sound correct to me.

To document the changes, I like to list them in the migration guide, grouped by 
whether they are breaking changes, removed deprecated items, behavior changes, 
etc.

By the way, can you please not put items specific to this release in the JIRA 
description?  It makes things easier if we can clone these QA JIRAs for each 
new release and minimize the editing needed.  Feel free to update the 
instructions, though.

> ML, Graph 2.0 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-14810
> URL: https://issues.apache.org/jira/browse/SPARK-14810
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Nick Pentreath
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.
> List of changes since {{1.6.0}} audited - these are "false positives" due to 
> being private, @Experimental, DeveloperAPI, etc:
> * SPARK-13686 - Add a constructor parameter `regParam` to 
> (Streaming)LinearRegressionWithSGD
> * SPARK-13664 - Replace HadoopFsRelation with FileFormat
> * SPARK-11622 - Make LibSVMRelation extends HadoopFsRelation and Add 
> LibSVMOutputWriter
> * SPARK-13920 - MIMA checks should apply to @Experimental and @DeveloperAPI 
> APIs
> * SPARK-11011 - UserDefinedType serialization should be strongly typed
> * SPARK-13817 - Re-enable MiMA and removes object DataFrame
> * SPARK-13927 - add row/column iterator to local matrices - (add methods to 
> sealed trait)
> * SPARK-13948 - MiMa Check should catch if the visibility change to `private` 
> - (DataFrame -> Dataset)
> * SPARK-11262 - Unit test for gradient, loss layers, memory management - 
> (private class)
> * SPARK-13430 - moved featureCol from LinearRegressionModelSummary to 
> LinearRegressionSummary - (private class)
> * SPARK-13048 - keepLastCheckpoint option for LDA EM optimizer - (private 
> class)
> * SPARK-14734 - Add conversions between mllib and ml Vector, Matrix types - 
> (private methods added)
> * SPARK-14861 - Replace internal usages of SQLContext with SparkSession - 
> (private class)
> Binary incompatible changes:
> * SPARK-14089 - Remove methods that has been deprecated since 1.1, 1.2, 1.3, 
> 1.4, and 1.5 
> * SPARK-14952 - Remove methods deprecated in 1.6
> * DataFrame -> Dataset changes for Java (this of course applies for all 
> of Spark SQL)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7424) spark.ml classification, regression abstractions should add metadata to output column

2016-05-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285608#comment-15285608
 ] 

Joseph K. Bradley commented on SPARK-7424:
--

I'm retargeting for 2.1 since we need to focus on QA now.

> spark.ml classification, regression abstractions should add metadata to 
> output column
> -
>
> Key: SPARK-7424
> URL: https://issues.apache.org/jira/browse/SPARK-7424
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Update ClassificationModel, ProbabilisticClassificationModel prediction to 
> include numClasses in output column metadata.
> Update RegressionModel to specify output column metadata as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7424) spark.ml classification, regression abstractions should add metadata to output column

2016-05-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7424:
-
Target Version/s: 2.1.0  (was: 2.0.0)

> spark.ml classification, regression abstractions should add metadata to 
> output column
> -
>
> Key: SPARK-7424
> URL: https://issues.apache.org/jira/browse/SPARK-7424
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Update ClassificationModel, ProbabilisticClassificationModel prediction to 
> include numClasses in output column metadata.
> Update RegressionModel to specify output column metadata as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-15356) AOL Customer Care Number @ 1800.545.7482 Help Desk Number & AOL MAIL Tech Support Phone Number

2016-05-16 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin deleted SPARK-15356:
---


>  AOL Customer Care Number @ 1800.545.7482 Help Desk Number & AOL MAIL Tech 
> Support Phone Number 
> 
>
> Key: SPARK-15356
> URL: https://issues.apache.org/jira/browse/SPARK-15356
> Project: Spark
>  Issue Type: Bug
> Environment:  AOL Customer Care Number @ 1800.545.7482 Help Desk 
> Number & AOL MAIL Tech Support Phone Number 
>Reporter: lola pola
>
> Support & Service Call -)) 1800 545 7482 )))AOL Tech support phone number  
> AOL support Phone number  %%2$$ AOL customer support phone number 
> +1800-545-7482  AOL customer service number,1800^545^7482  AOL helpdesk phone 
> number, AOL customer care number, 1800*545*7482 AOL support phone number, 
> AOL password recovery phone number, 1800::545::7482 AOL customer care  phone 
> number, AOL customer service  number AOL official phone number 
> 1800**545**7482 @$@$
> +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care 
> number AOL support phone number 1800 545 7482 AOL customer care number 1800 
> 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK 
> AOL Email helpdesk number AOL Password recovery phone number AOL tech support 
> number 1800 545 7482 AOL Technical support number
> 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number 
> AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 
> AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
> helpdesk number AOL Password recovery phone number AOL tech support number 
> 1800-545-7482 AOL Technical support number @@/CANADA 
> +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care 
> number AOL support phone number 1800 545 7482 AOL customer care number 1800 
> 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK 
> AOL Email helpdesk number AOL Password recovery phone number AOL tech support 
> number 1800 545 7482 AOL Technical support number
> 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number 
> AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 
> AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
> helpdesk number AOL Password recovery phone number AOL tech support number 
> 1800-545-7482 AOL Technical support number @@/CANADA 
> +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care 
> number AOL support phone number 1800 545 7482 AOL customer care number 1800 
> 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK 
> AOL Email helpdesk number AOL Password recovery phone number AOL tech support 
> number 1800 545 7482 AOL Technical support number
> 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number 
> AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 
> AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
> helpdesk number AOL Password recovery phone number AOL tech support number 
> 1800-545-7482 AOL Technical support number @@/CANADA 
> +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care 
> number AOL support phone number 1800 545 7482 AOL customer care number 1800 
> 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK 
> AOL Email helpdesk number AOL Password recovery phone number AOL tech support 
> number 1800 545 7482 AOL Technical support number
> 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number 
> AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 
> AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
> helpdesk number AOL Password recovery phone number AOL tech support number 
> 1800-545-7482 AOL Technical support number @@/CANADA 
> +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care 
> number AOL support phone number 1800 545 7482 AOL customer care number 1800 
> 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK 
> AOL Email helpdesk number AOL Password recovery phone number AOL tech support 
> number 1800 545 7482 AOL Technical support number
> 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number 
> AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 
> AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
> helpdesk number AOL Password recovery phone number AOL tech support number 
> 1800-545-7482 AOL Technical support number @@/CANADA 
> +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care 
>

[jira] [Updated] (SPARK-15328) Word2Vec import for original binary format

2016-05-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15328:
--
Priority: Minor  (was: Major)

> Word2Vec import for original binary format
> --
>
> Key: SPARK-15328
> URL: https://issues.apache.org/jira/browse/SPARK-15328
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yuming Wang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15328) Word2Vec import for original binary format

2016-05-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15328:
--
Component/s: (was: MLlib)

> Word2Vec import for original binary format
> --
>
> Key: SPARK-15328
> URL: https://issues.apache.org/jira/browse/SPARK-15328
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Yuming Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15356) AOL Customer Care Number @ 1800.545.7482 Help Desk Number & AOL MAIL Tech Support Phone Number

2016-05-16 Thread lola pola (JIRA)

lola pola created SPARK-15356:
-

 Summary:  AOL Customer Care Number @ 1800.545.7482 Help Desk 
Number & AOL MAIL Tech Support Phone Number 
 Key: SPARK-15356
 URL: https://issues.apache.org/jira/browse/SPARK-15356
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.6.1
 Environment:  AOL Customer Care Number @ 1800.545.7482 Help Desk 
Number & AOL MAIL Tech Support Phone Number 
Reporter: lola pola


Support & Service Call -)) 1800 545 7482 )))AOL Tech support phone number  AOL 
support Phone number  %%2$$ AOL customer support phone number +1800-545-7482  
AOL customer service number,1800^545^7482  AOL helpdesk phone number, AOL 
customer care number, 1800*545*7482 AOL support phone number, AOL password 
recovery phone number, 1800::545::7482 AOL customer care  phone number, AOL 
customer service  number AOL official phone number 1800**545**7482 @$@$


+1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care number 
AOL support phone number 1800 545 7482 AOL customer care number 1800 545 7482 
AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
helpdesk number AOL Password recovery phone number AOL tech support number 1800 
545 7482 AOL Technical support number
1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number 
AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 
AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
helpdesk number AOL Password recovery phone number AOL tech support number 
1800-545-7482 AOL Technical support number @@/CANADA 
+1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care number 
AOL support phone number 1800 545 7482 AOL customer care number 1800 545 7482 
AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
helpdesk number AOL Password recovery phone number AOL tech support number 1800 
545 7482 AOL Technical support number
1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number 
AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 
AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
helpdesk number AOL Password recovery phone number AOL tech support number 
1800-545-7482 AOL Technical support number @@/CANADA 
+1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care number 
AOL support phone number 1800 545 7482 AOL customer care number 1800 545 7482 
AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
helpdesk number AOL Password recovery phone number AOL tech support number 1800 
545 7482 AOL Technical support number
1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number 
AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 
AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
helpdesk number AOL Password recovery phone number AOL tech support number 
1800-545-7482 AOL Technical support number @@/CANADA 
+1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care number 
AOL support phone number 1800 545 7482 AOL customer care number 1800 545 7482 
AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
helpdesk number AOL Password recovery phone number AOL tech support number 1800 
545 7482 AOL Technical support number
1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number 
AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 
AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
helpdesk number AOL Password recovery phone number AOL tech support number 
1800-545-7482 AOL Technical support number @@/CANADA 
+1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care number 
AOL support phone number 1800 545 7482 AOL customer care number 1800 545 7482 
AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
helpdesk number AOL Password recovery phone number AOL tech support number 1800 
545 7482 AOL Technical support number
1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number 
AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 
AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
helpdesk number AOL Password recovery phone number AOL tech support number 
1800-545-7482 AOL Technical support number @@/CANADA 
+1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care number 
AOL support phone number 1800 545 7482 AOL customer care number 1800 545 7482 
AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK AOL Email 
helpdesk number AOL Password recovery phone number AOL tech support number 1800 
545 7482 AOL Technical support number

[jira] [Updated] (SPARK-15254) Improve ML pipeline Cross Validation Scaladoc & PyDoc

2016-05-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15254:
--
Component/s: Documentation

> Improve ML pipeline Cross Validation Scaladoc & PyDoc
> -
>
> Key: SPARK-15254
> URL: https://issues.apache.org/jira/browse/SPARK-15254
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: holdenk
>Priority: Minor
>
> The ML pipeline Cross Validation Scaladoc & PyDoc is very spares - we should 
> fill this out with a more concrete description.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15254) Improve ML pipeline Cross Validation Scaladoc & PyDoc

2016-05-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15254:
--
Issue Type: Documentation  (was: Improvement)

> Improve ML pipeline Cross Validation Scaladoc & PyDoc
> -
>
> Key: SPARK-15254
> URL: https://issues.apache.org/jira/browse/SPARK-15254
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: holdenk
>Priority: Minor
>
> The ML pipeline Cross Validation Scaladoc & PyDoc is very spares - we should 
> fill this out with a more concrete description.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15194) Add Python ML API for MultivariateGaussian

2016-05-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285595#comment-15285595
 ] 

Joseph K. Bradley commented on SPARK-15194:
---

This should be implemented using numpy, within mllib-local, as [~holdenk] said. 
 But you'll need to wait until the blocker JIRA is done.

> Add Python ML API for MultivariateGaussian
> --
>
> Key: SPARK-15194
> URL: https://issues.apache.org/jira/browse/SPARK-15194
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Minor
>
> We have a PySpark API for the MLLib version but not the ML version. This 
> would allow Python's  `GaussianMixture` to more closely match the Scala API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15164) Mark classification algorithms as experimental where marked so in scala

2016-05-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15164:
--
Target Version/s: 2.0.0

> Mark classification algorithms as experimental where marked so in scala
> ---
>
> Key: SPARK-15164
> URL: https://issues.apache.org/jira/browse/SPARK-15164
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: holdenk
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15145) port binary classification evaluator to spark.ml

2016-05-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285577#comment-15285577
 ] 

Joseph K. Bradley commented on SPARK-15145:
---

[~wm624] Can you please update this JIRA title and description?  (The evaluator 
already is in spark.ml; this needs to be more specific.)  Also, please update 
the PR.  Thanks!

> port binary classification evaluator to spark.ml
> 
>
> Key: SPARK-15145
> URL: https://issues.apache.org/jira/browse/SPARK-15145
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Miao Wang
>
> As we discussed in #12922, binary classification evaluator should be ported 
> from mllib to spark.ml after 2.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15145) port binary classification evaluator to spark.ml

2016-05-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285577#comment-15285577
 ] 

Joseph K. Bradley edited comment on SPARK-15145 at 5/16/16 10:45 PM:
-

[~wm624] Can you please update this JIRA title and description?  (The evaluator 
already is in spark.ml; this needs to be more specific.)  Also, please update 
the PR title & description too.  Thanks!


was (Author: josephkb):
[~wm624] Can you please update this JIRA title and description?  (The evaluator 
already is in spark.ml; this needs to be more specific.)  Also, please update 
the PR.  Thanks!

> port binary classification evaluator to spark.ml
> 
>
> Key: SPARK-15145
> URL: https://issues.apache.org/jira/browse/SPARK-15145
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Miao Wang
>
> As we discussed in #12922, binary classification evaluator should be ported 
> from mllib to spark.ml after 2.0 release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15355) Pro-active block replenishment in case of node/executor failures

2016-05-16 Thread Shubham Chopra (JIRA)

Shubham Chopra created SPARK-15355:
--

 Summary: Pro-active block replenishment in case of node/executor 
failures
 Key: SPARK-15355
 URL: https://issues.apache.org/jira/browse/SPARK-15355
 Project: Spark
  Issue Type: Sub-task
  Components: Block Manager, Spark Core
Reporter: Shubham Chopra


Spark currently does not replenish lost replicas. For resiliency and high 
availability, BlockManagerMasterEndpoint can proactively verify whether all 
cached RDDs have enough replicas, and replenish them, in case they don’t.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15354) Topology aware block replication strategies

2016-05-16 Thread Shubham Chopra (JIRA)

Shubham Chopra created SPARK-15354:
--

 Summary: Topology aware block replication strategies
 Key: SPARK-15354
 URL: https://issues.apache.org/jira/browse/SPARK-15354
 Project: Spark
  Issue Type: Sub-task
  Components: Mesos, Spark Core, YARN
Reporter: Shubham Chopra


Implementations of strategies for resilient block replication for different 
resource managers that replicate the 3-replica strategy used by HDFS, where the 
first replica is on an executor, the second replica within the same rack as the 
executor and a third replica on a different rack. 
The implementation involves providing two pluggable classes, one running in the 
driver that provides topology information for every host at cluster start and 
the second prioritizing a list of peer BlockManagerIds. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15353) Making peer selection for block replication pluggable

2016-05-16 Thread Shubham Chopra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Chopra updated SPARK-15353:
---
Attachment: BlockManagerSequenceDiagram.png

Sequence diagram explaining the various calls between BlockManager and 
BlockManagerMasterEndpoint for topology aware block replication

> Making peer selection for block replication pluggable
> -
>
> Key: SPARK-15353
> URL: https://issues.apache.org/jira/browse/SPARK-15353
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Shubham Chopra
> Attachments: BlockManagerSequenceDiagram.png
>
>
> BlockManagers running on executors provide all logistics around block 
> management. Before a BlockManager can be used, it has to be “initialized”. As 
> a part of the initialization, BlockManager asks the 
> BlockManagerMasterEndpoint to give it topology information. The 
> BlockManagerMasterEndpoint is provided a pluggable interface that can be used 
> to resolve a hostname to topology. This information is used to decorate the 
> BlockManagerId. This happens at cluster start and whenever a new executor is 
> added.
> During replication, the BlockManager gets the list of all its peers in the 
> form of a Seq[BlockManagerId]. We add a pluggable prioritizer that can be 
> used to prioritize this list of peers based on topology information. Peers 
> with higher priority occur first in the sequence and the BlockManager tries 
> to replicate blocks in that order.
> There would be default implementations for these pluggable interfaces that 
> replicate the existing behavior of randomly choosing a peer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2016-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285539#comment-15285539
 ] 

Sean Owen commented on SPARK-3785:
--

That, and things like YARN labels are indeed a pre-requisite to be able to 
target work at machines with a GPU. Those are already done. But this is about 
doing something in Spark to off-load something to a GPU. It doesn't actually 
require Spark's support any further; already works.

> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15353) Making peer selection for block replication pluggable

2016-05-16 Thread Shubham Chopra (JIRA)

Shubham Chopra created SPARK-15353:
--

 Summary: Making peer selection for block replication pluggable
 Key: SPARK-15353
 URL: https://issues.apache.org/jira/browse/SPARK-15353
 Project: Spark
  Issue Type: Sub-task
  Components: Block Manager, Spark Core
Reporter: Shubham Chopra


BlockManagers running on executors provide all logistics around block 
management. Before a BlockManager can be used, it has to be “initialized”. As a 
part of the initialization, BlockManager asks the BlockManagerMasterEndpoint to 
give it topology information. The BlockManagerMasterEndpoint is provided a 
pluggable interface that can be used to resolve a hostname to topology. This 
information is used to decorate the BlockManagerId. This happens at cluster 
start and whenever a new executor is added.
During replication, the BlockManager gets the list of all its peers in the form 
of a Seq[BlockManagerId]. We add a pluggable prioritizer that can be used to 
prioritize this list of peers based on topology information. Peers with higher 
priority occur first in the sequence and the BlockManager tries to replicate 
blocks in that order.
There would be default implementations for these pluggable interfaces that 
replicate the existing behavior of randomly choosing a peer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15352) Topology aware block replication

2016-05-16 Thread Shubham Chopra (JIRA)

Shubham Chopra created SPARK-15352:
--

 Summary: Topology aware block replication
 Key: SPARK-15352
 URL: https://issues.apache.org/jira/browse/SPARK-15352
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager, Mesos, Spark Core, YARN
Reporter: Shubham Chopra


With cached RDDs, Spark can be used for online analytics where it is used to 
respond to online queries. But loss of RDD partitions due to node/executor 
failures can cause huge delays in such use cases as the data would have to be 
regenerated.
Cached RDDs, even when using multiple replicas per block, are not currently 
resilient to node failures when multiple executors are started on the same 
node. Block replication currently chooses a peer at random, and this peer could 
also exist on the same host. 
This effort would add topology aware replication to Spark that can be enabled 
with pluggable strategies. For ease of development/review, this is being broken 
down to three major work-efforts:
1.  Making peer selection for replication pluggable
2.  Providing pluggable implementations for providing topology and topology 
aware replication
3.  Pro-active replenishment of lost blocks




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15100) Audit: ml.feature

2016-05-16 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285366#comment-15285366
 ] 

Bryan Cutler commented on SPARK-15100:
--

I can do a PR to update CountVectorizer and HashingTF

> Audit: ml.feature
> -
>
> Key: SPARK-15100
> URL: https://issues.apache.org/jira/browse/SPARK-15100
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Audit this sub-package for new algorithms which do not have corresponding 
> sections & examples in the user guide.
> See parent issue for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe

2016-05-16 Thread Barry Becker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285321#comment-15285321
 ] 

Barry Becker commented on SPARK-15230:
--

I updated the description so it says distinct instead of describe. I believe 
there is a separate jira for the problem with describe not handling backquoted 
columns.

> Back quoted column with dot in it fails when running distinct on dataframe
> --
>
> Key: SPARK-15230
> URL: https://issues.apache.org/jira/browse/SPARK-15230
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.0
>Reporter: Barry Becker
>
> When working with a dataframe columns with .'s in them must be backquoted 
> (``) or the column name will not be found. This works for most dataframe 
> methods, but I discovered that it does not work for distinct().
> Suppose you have a dataFrame, testDf, with a DoubleType column named 
> {{pos.NoZero}}.  This statememt:
> {noformat}
> testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ")
> {noformat}
> will fail with this error:
> {noformat}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name 
> "pos.NoZero" among (pos.NoZero);
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319)
>   at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612)
>   at 
> com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe

2016-05-16 Thread Barry Becker (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Barry Becker updated SPARK-15230:
-
Description: 
When working with a dataframe columns with .'s in them must be backquoted (``) 
or the column name will not be found. This works for most dataframe methods, 
but I discovered that it does not work for distinct().

Suppose you have a dataFrame, testDf, with a DoubleType column named 
{{pos.NoZero}}.  This statememt:
{noformat}
testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ")
{noformat}
will fail with this error:
{noformat}
org.apache.spark.sql.AnalysisException: Cannot resolve column name "pos.NoZero" 
among (pos.NoZero);

at 
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at 
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at 
org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
at 
org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329)
at 
org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328)
at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348)
at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319)
at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612)
at 
com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317)
{noformat}


  was:
When working with a dataframe columns with .'s in them must be backquoted (``) 
or the column name will not be found. This works for most dataframe methods, 
but I discovered that it does not work for describe().

Suppose you have a dataFrame, testDf, with a DoubleType column named 
{{pos.NoZero}}.  This statememt:
{noformat}
testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ")
{noformat}
will fail with this error:
{noformat}
org.apache.spark.sql.AnalysisException: Cannot resolve column name "pos.NoZero" 
among (pos.NoZero);

at 
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at 
org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
at 
org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
at 
org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329)
at 
org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328)
at 
org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328)
at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348)
at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319)
at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612)
at 
com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317)
{noformat}



> Back quoted column with dot in it fails when running distinct on dataframe
>

[jira] [Comment Edited] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe

2016-05-16 Thread Bo Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285302#comment-15285302
 ] 

Bo Meng edited comment on SPARK-15230 at 5/16/16 9:11 PM:
--

In the description, {{it does not work for describe()}} should be {{it does not 
work for distinct()}}, please update the description, thanks.


was (Author: bomeng):
In the description, `it does not work for describe()` should be `it does not 
work for distinct()`, please update the description, thanks.

> Back quoted column with dot in it fails when running distinct on dataframe
> --
>
> Key: SPARK-15230
> URL: https://issues.apache.org/jira/browse/SPARK-15230
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.0
>Reporter: Barry Becker
>
> When working with a dataframe columns with .'s in them must be backquoted 
> (``) or the column name will not be found. This works for most dataframe 
> methods, but I discovered that it does not work for describe().
> Suppose you have a dataFrame, testDf, with a DoubleType column named 
> {{pos.NoZero}}.  This statememt:
> {noformat}
> testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ")
> {noformat}
> will fail with this error:
> {noformat}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name 
> "pos.NoZero" among (pos.NoZero);
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319)
>   at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612)
>   at 
> com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe

2016-05-16 Thread Bo Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285302#comment-15285302
 ] 

Bo Meng commented on SPARK-15230:
-

In the description, `it does not work for describe()` should be `it does not 
work for distinct()`, please update the description, thanks.

> Back quoted column with dot in it fails when running distinct on dataframe
> --
>
> Key: SPARK-15230
> URL: https://issues.apache.org/jira/browse/SPARK-15230
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.0
>Reporter: Barry Becker
>
> When working with a dataframe columns with .'s in them must be backquoted 
> (``) or the column name will not be found. This works for most dataframe 
> methods, but I discovered that it does not work for describe().
> Suppose you have a dataFrame, testDf, with a DoubleType column named 
> {{pos.NoZero}}.  This statememt:
> {noformat}
> testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ")
> {noformat}
> will fail with this error:
> {noformat}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name 
> "pos.NoZero" among (pos.NoZero);
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319)
>   at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612)
>   at 
> com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe

2016-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15230:


Assignee: (was: Apache Spark)

> Back quoted column with dot in it fails when running distinct on dataframe
> --
>
> Key: SPARK-15230
> URL: https://issues.apache.org/jira/browse/SPARK-15230
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.0
>Reporter: Barry Becker
>
> When working with a dataframe columns with .'s in them must be backquoted 
> (``) or the column name will not be found. This works for most dataframe 
> methods, but I discovered that it does not work for describe().
> Suppose you have a dataFrame, testDf, with a DoubleType column named 
> {{pos.NoZero}}.  This statememt:
> {noformat}
> testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ")
> {noformat}
> will fail with this error:
> {noformat}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name 
> "pos.NoZero" among (pos.NoZero);
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319)
>   at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612)
>   at 
> com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe

2016-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285263#comment-15285263
 ] 

Apache Spark commented on SPARK-15230:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/13140

> Back quoted column with dot in it fails when running distinct on dataframe
> --
>
> Key: SPARK-15230
> URL: https://issues.apache.org/jira/browse/SPARK-15230
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.0
>Reporter: Barry Becker
>
> When working with a dataframe columns with .'s in them must be backquoted 
> (``) or the column name will not be found. This works for most dataframe 
> methods, but I discovered that it does not work for describe().
> Suppose you have a dataFrame, testDf, with a DoubleType column named 
> {{pos.NoZero}}.  This statememt:
> {noformat}
> testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ")
> {noformat}
> will fail with this error:
> {noformat}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name 
> "pos.NoZero" among (pos.NoZero);
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319)
>   at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612)
>   at 
> com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe

2016-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15230:


Assignee: Apache Spark

> Back quoted column with dot in it fails when running distinct on dataframe
> --
>
> Key: SPARK-15230
> URL: https://issues.apache.org/jira/browse/SPARK-15230
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 1.6.0
>Reporter: Barry Becker
>Assignee: Apache Spark
>
> When working with a dataframe columns with .'s in them must be backquoted 
> (``) or the column name will not be found. This works for most dataframe 
> methods, but I discovered that it does not work for describe().
> Suppose you have a dataFrame, testDf, with a DoubleType column named 
> {{pos.NoZero}}.  This statememt:
> {noformat}
> testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ")
> {noformat}
> will fail with this error:
> {noformat}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name 
> "pos.NoZero" among (pos.NoZero);
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348)
>   at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319)
>   at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612)
>   at 
> com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2016-05-16 Thread Bill Zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285248#comment-15285248
 ] 

Bill Zhao commented on SPARK-3785:
--

Mesos has added the GPU support in 0.29 release.  
https://issues.apache.org/jira/browse/MESOS-4424  If Spark can use GPU as 
resource from mesos, it will expedite the GPU computation for Spark.

> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14942) Reduce delay between batch construction and execution

2016-05-16 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-14942.
--
   Resolution: Fixed
 Assignee: Liwei Lin
Fix Version/s: 2.0.0

> Reduce delay between batch construction and execution
> -
>
> Key: SPARK-14942
> URL: https://issues.apache.org/jira/browse/SPARK-14942
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liwei Lin
>Assignee: Liwei Lin
> Fix For: 2.0.0
>
>
> Currently in {{StreamExecution}}, we first run the batch, then construct the 
> next:
> {code}
> if (dataAvailable) runBatch()
> constructNextBatch()
> {code}
> This is good if we run batches ASAP, where data would get processed in the 
> very next batch:
> !https://cloud.githubusercontent.com/assets/15843379/14779964/2786e698-0b0d-11e6-9d2c-bb41513488b2.png!
> However, if we run batches at trigger like {{ProcessTime("1 minute")}}, data 
> - such as y below - may not get processed in the very next batch i.e. batch 
> 1, but in batch 2:
> !https://cloud.githubusercontent.com/assets/15843379/14779818/6f3bb064-0b0c-11e6-9f16-c1ce4897186b.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15072) Remove SparkSession.withHiveSupport

2016-05-16 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285112#comment-15285112
 ] 

Nicholas Chammas commented on SPARK-15072:
--

Brief note from [~yhuai] on the motivation behind this issue: 
https://github.com/apache/spark/pull/13069#issuecomment-219516577

> Remove SparkSession.withHiveSupport
> ---
>
> Key: SPARK-15072
> URL: https://issues.apache.org/jira/browse/SPARK-15072
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Sandeep Singh
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15186) Add user guide for Generalized Linear Regression.

2016-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15186:


Assignee: Seth Hendrickson  (was: Apache Spark)

> Add user guide for Generalized Linear Regression.
> -
>
> Key: SPARK-15186
> URL: https://issues.apache.org/jira/browse/SPARK-15186
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
>
> We should add a user guide for the new GLR interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15186) Add user guide for Generalized Linear Regression.

2016-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15186:


Assignee: Apache Spark  (was: Seth Hendrickson)

> Add user guide for Generalized Linear Regression.
> -
>
> Key: SPARK-15186
> URL: https://issues.apache.org/jira/browse/SPARK-15186
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, ML
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>Priority: Minor
>
> We should add a user guide for the new GLR interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15186) Add user guide for Generalized Linear Regression.

2016-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285034#comment-15285034
 ] 

Apache Spark commented on SPARK-15186:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/13139

> Add user guide for Generalized Linear Regression.
> -
>
> Key: SPARK-15186
> URL: https://issues.apache.org/jira/browse/SPARK-15186
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Minor
>
> We should add a user guide for the new GLR interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-05-16 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-15343.

Resolution: Not A Problem

Closing as "not a problem" since this is an issue with 3rd-party code.

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-05-16 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284981#comment-15284981
 ] 

Marcelo Vanzin commented on SPARK-15343:


bq. at 
org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)

You're using a 3rd-party module developed by Hortonworks to talk to the YARN 
ATS; they include it as part of their distribution, but I believe it's not yet 
compatible with Spark 2.0. So you need to follow up with them, since this is 
not an issue with Spark, or disable that feature.

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException:

[jira] [Assigned] (SPARK-15351) RowEncoder should support array as the external type for ArrayType

2016-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15351:


Assignee: Apache Spark  (was: Wenchen Fan)

> RowEncoder should support array as the external type for ArrayType
> --
>
> Key: SPARK-15351
> URL: https://issues.apache.org/jira/browse/SPARK-15351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15351) RowEncoder should support array as the external type for ArrayType

2016-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284913#comment-15284913
 ] 

Apache Spark commented on SPARK-15351:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/13138

> RowEncoder should support array as the external type for ArrayType
> --
>
> Key: SPARK-15351
> URL: https://issues.apache.org/jira/browse/SPARK-15351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15351) RowEncoder should support array as the external type for ArrayType

2016-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15351:


Assignee: Wenchen Fan  (was: Apache Spark)

> RowEncoder should support array as the external type for ArrayType
> --
>
> Key: SPARK-15351
> URL: https://issues.apache.org/jira/browse/SPARK-15351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15347) Problem select empty ORC table

2016-05-16 Thread Pedro Prado (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284907#comment-15284907
 ] 

Pedro Prado commented on SPARK-15347:
-

Sorry Sean! my fault!

> Problem select empty ORC table
> --
>
> Key: SPARK-15347
> URL: https://issues.apache.org/jira/browse/SPARK-15347
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: Hadoop 2.7.1.2.4.2.0-258
> Subversion g...@github.com:hortonworks/hadoop.git -r 
> 13debf893a605e8a88df18a7d8d214f571e05289
> Compiled by jenkins on 2016-04-25T05:46Z
> Compiled with protoc 2.5.0
> From source with checksum 2a2d95f05ec6c3ac547ed58cab713ac
> This command was run using 
> /usr/hdp/2.4.2.0-258/hadoop/hadoop-common-2.7.1.2.4.2.0-258.jar
>Reporter: Pedro Prado
>
> Error when I selected empty ORC table
> [pprado@hadoop-m ~]$ beeline -u jdbc:hive2://
> WARNING: Use "yarn jar" to launch YARN applications.
> Connecting to jdbc:hive2://
> Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
> Driver: Hive JDBC (version 1.2.1000.2.4.2.0-258)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 1.2.1000.2.4.2.0-258 by Apache Hive
> On beeline => create table my_test (id int, name String) stored as orc;
> On beeline => select * from my_test;
> 16/05/13 18:18:57 [main]: ERROR hdfs.KeyProviderCache: Could not find uri 
> with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
> OK
> +-+---+--+
> | my_test.id | my_test.name |
> +-+---+--+
> +-+---+--+
> No rows selected (1.227 seconds)
> Hive is OK!
> Now, when i execute pyspark.
> Welcome to
> SPARK version 1.6.1
> Using Python version 2.6.6 (r266:84292, Jul 23 2015 15:22:56)
> SparkContext available as sc, HiveContext available as sqlContext.
> PySpark => sqlContext.sql("select * from my_test")
> 16/05/13 18:33:41 INFO ParseDriver: Parsing command: select * from my_test
> 16/05/13 18:33:41 INFO ParseDriver: Parse Completed
> Traceback (most recent call last):
> File "", line 1, in
> File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/context.py", line 
> 580, in sql
> return DataFrame(self.ssql_ctx.sql(sqlQuery), self)
> File 
> "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in __call_
> File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/utils.py", line 53, 
> in deco
> raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.IllegalArgumentException: u'orcFileOperator: path 
> hdfs://hadoop-m.c.sva-0001.internal:8020/apps/hive/warehouse/my_test does not 
> have valid orc files matching the pattern'
> when i create parquet table, it's all right. I do not have problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15351) RowEncoder should support array as the external type for ArrayType

2016-05-16 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-15351:
---

 Summary: RowEncoder should support array as the external type for 
ArrayType
 Key: SPARK-15351
 URL: https://issues.apache.org/jira/browse/SPARK-15351
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15272) DirectKafkaInputDStream doesn't work with window operation

2016-05-16 Thread Lubomir Nerad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284776#comment-15284776
 ] 

Lubomir Nerad commented on SPARK-15272:
---

We can workaround the Kafka part of the issue. But what about the delay 
scheduling algorithm? Can't the same problem arise if for example some host 
dies after a TaskSet has been constructed with tasks having it in their 
preferred locations?

> DirectKafkaInputDStream doesn't work with window operation
> --
>
> Key: SPARK-15272
> URL: https://issues.apache.org/jira/browse/SPARK-15272
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Lubomir Nerad
>
> Using Kafka direct {{DStream}} with simple window operation like:
> {code:java}
> kafkaDStream.window(Durations.milliseconds(1),
> Durations.milliseconds(1000));
> .print();
> {code}
> with 1s batch duration either freezes after several seconds or lags terribly 
> (depending on cluster mode).
> This happens when Kafka brokers are not part of the Spark cluster (they are 
> on different nodes). The {{KafkaRDD}} still reports them as preferred 
> locations. This doesn't seem to be problem in non-window scenarios but with 
> window it conflicts with delay scheduling algorithm implemented in 
> {{TaskSetManager}}. It either significantly delays (Yarn mode) or completely 
> drains (Spark mode) resource offers with {{TaskLocality.ANY}} which are 
> needed to process tasks with these Kafka broker aligned preferred locations. 
> When delay scheduling algorithm is switched off ({{spark.locality.wait=0}}), 
> the example works correctly.
> I think that the {{KafkaRDD}} shouldn't report preferred locations if the 
> brokers don't correspond to worker nodes or allow the reporting of preferred 
> locations to be switched off. Also it would be good if delay scheduling 
> algorithm didn't drain / delay offers in the case, the tasks have unmatched 
> preferred locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks

2016-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15247:


Assignee: (was: Apache Spark)

> sqlCtx.read.parquet yields at least n_executors * n_cores tasks
> ---
>
> Key: SPARK-15247
> URL: https://issues.apache.org/jira/browse/SPARK-15247
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Johnny W.
>
> sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even 
> though this is only 1 very small file
> This issue can increase the latency for small jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks

2016-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284771#comment-15284771
 ] 

Apache Spark commented on SPARK-15247:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/13137

> sqlCtx.read.parquet yields at least n_executors * n_cores tasks
> ---
>
> Key: SPARK-15247
> URL: https://issues.apache.org/jira/browse/SPARK-15247
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Johnny W.
>
> sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even 
> though this is only 1 very small file
> This issue can increase the latency for small jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks

2016-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15247:


Assignee: Apache Spark

> sqlCtx.read.parquet yields at least n_executors * n_cores tasks
> ---
>
> Key: SPARK-15247
> URL: https://issues.apache.org/jira/browse/SPARK-15247
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Johnny W.
>Assignee: Apache Spark
>
> sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even 
> though this is only 1 very small file
> This issue can increase the latency for small jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks

2016-05-16 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284763#comment-15284763
 ] 

Takeshi Yamamuro commented on SPARK-15247:
--

I'll make a pr to fix this.

> sqlCtx.read.parquet yields at least n_executors * n_cores tasks
> ---
>
> Key: SPARK-15247
> URL: https://issues.apache.org/jira/browse/SPARK-15247
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Johnny W.
>
> sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even 
> though this is only 1 very small file
> This issue can increase the latency for small jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks

2016-05-16 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284486#comment-15284486
 ] 

Takeshi Yamamuro edited comment on SPARK-15247 at 5/16/16 3:56 PM:
---

Not yet. Actually, I'm not 100% sure that this issue needs to be fixed.


was (Author: maropu):
Not yet. Actually, I'm not sure that this issue needs to be fixed.

> sqlCtx.read.parquet yields at least n_executors * n_cores tasks
> ---
>
> Key: SPARK-15247
> URL: https://issues.apache.org/jira/browse/SPARK-15247
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Johnny W.
>
> sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even 
> though this is only 1 very small file
> This issue can increase the latency for small jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15350) Add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite

2016-05-16 Thread Weichen Xu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-15350:
---
Priority: Minor  (was: Major)

> Add unit test function for LogisticRegressionWithLBFGS in 
> JavaLogisticRegressionSuite
> -
>
> Key: SPARK-15350
> URL: https://issues.apache.org/jira/browse/SPARK-15350
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add unit test function for LogisticRegressionWithLBFGS in 
> JavaLogisticRegressionSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15350) Add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite

2016-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15350:


Assignee: Apache Spark

> Add unit test function for LogisticRegressionWithLBFGS in 
> JavaLogisticRegressionSuite
> -
>
> Key: SPARK-15350
> URL: https://issues.apache.org/jira/browse/SPARK-15350
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add unit test function for LogisticRegressionWithLBFGS in 
> JavaLogisticRegressionSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15350) Add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite

2016-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284718#comment-15284718
 ] 

Apache Spark commented on SPARK-15350:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/13136

> Add unit test function for LogisticRegressionWithLBFGS in 
> JavaLogisticRegressionSuite
> -
>
> Key: SPARK-15350
> URL: https://issues.apache.org/jira/browse/SPARK-15350
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add unit test function for LogisticRegressionWithLBFGS in 
> JavaLogisticRegressionSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15350) Add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite

2016-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15350:


Assignee: (was: Apache Spark)

> Add unit test function for LogisticRegressionWithLBFGS in 
> JavaLogisticRegressionSuite
> -
>
> Key: SPARK-15350
> URL: https://issues.apache.org/jira/browse/SPARK-15350
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Add unit test function for LogisticRegressionWithLBFGS in 
> JavaLogisticRegressionSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15350) Add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite

2016-05-16 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-15350:
--

 Summary: Add unit test function for LogisticRegressionWithLBFGS in 
JavaLogisticRegressionSuite
 Key: SPARK-15350
 URL: https://issues.apache.org/jira/browse/SPARK-15350
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.0.0
Reporter: Weichen Xu


Add unit test function for LogisticRegressionWithLBFGS in 
JavaLogisticRegressionSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15348) Hive ACID

2016-05-16 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284698#comment-15284698
 ] 

Ran Haim edited comment on SPARK-15348 at 5/16/16 3:09 PM:
---

This means that if I have a transnational table in hive, I cannot use a spark 
job to update it or even read it in a coherent way.


was (Author: ran.h...@optimalplus.com):
If I have a transnational table in hive, I cannot use spark job to update it or 
even read it in a coherent way.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>Reporter: Ran Haim
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15348) Hive ACID

2016-05-16 Thread Ran Haim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284698#comment-15284698
 ] 

Ran Haim commented on SPARK-15348:
--

If I have a transnational table in hive, I cannot use spark job to update it or 
even read it in a coherent way.

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>Reporter: Ran Haim
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-15349) Hive ACID

2016-05-16 Thread Ran Haim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ran Haim closed SPARK-15349.

Resolution: Duplicate

> Hive ACID
> -
>
> Key: SPARK-15349
> URL: https://issues.apache.org/jira/browse/SPARK-15349
> Project: Spark
>  Issue Type: New Feature
>Reporter: Ran Haim
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15347) Problem select empty ORC table

2016-05-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-15347.
---
   Resolution: Duplicate
Fix Version/s: (was: 1.6.0)

Please have a look through JIRA first and read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> Problem select empty ORC table
> --
>
> Key: SPARK-15347
> URL: https://issues.apache.org/jira/browse/SPARK-15347
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.1
> Environment: Hadoop 2.7.1.2.4.2.0-258
> Subversion g...@github.com:hortonworks/hadoop.git -r 
> 13debf893a605e8a88df18a7d8d214f571e05289
> Compiled by jenkins on 2016-04-25T05:46Z
> Compiled with protoc 2.5.0
> From source with checksum 2a2d95f05ec6c3ac547ed58cab713ac
> This command was run using 
> /usr/hdp/2.4.2.0-258/hadoop/hadoop-common-2.7.1.2.4.2.0-258.jar
>Reporter: Pedro Prado
>
> Error when I selected empty ORC table
> [pprado@hadoop-m ~]$ beeline -u jdbc:hive2://
> WARNING: Use "yarn jar" to launch YARN applications.
> Connecting to jdbc:hive2://
> Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
> Driver: Hive JDBC (version 1.2.1000.2.4.2.0-258)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 1.2.1000.2.4.2.0-258 by Apache Hive
> On beeline => create table my_test (id int, name String) stored as orc;
> On beeline => select * from my_test;
> 16/05/13 18:18:57 [main]: ERROR hdfs.KeyProviderCache: Could not find uri 
> with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
> OK
> +-+---+--+
> | my_test.id | my_test.name |
> +-+---+--+
> +-+---+--+
> No rows selected (1.227 seconds)
> Hive is OK!
> Now, when i execute pyspark.
> Welcome to
> SPARK version 1.6.1
> Using Python version 2.6.6 (r266:84292, Jul 23 2015 15:22:56)
> SparkContext available as sc, HiveContext available as sqlContext.
> PySpark => sqlContext.sql("select * from my_test")
> 16/05/13 18:33:41 INFO ParseDriver: Parsing command: select * from my_test
> 16/05/13 18:33:41 INFO ParseDriver: Parse Completed
> Traceback (most recent call last):
> File "", line 1, in
> File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/context.py", line 
> 580, in sql
> return DataFrame(self.ssql_ctx.sql(sqlQuery), self)
> File 
> "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
>  line 813, in __call_
> File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/utils.py", line 53, 
> in deco
> raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.IllegalArgumentException: u'orcFileOperator: path 
> hdfs://hadoop-m.c.sva-0001.internal:8020/apps/hive/warehouse/my_test does not 
> have valid orc files matching the pattern'
> when i create parquet table, it's all right. I do not have problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15348) Hive ACID

2016-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284688#comment-15284688
 ] 

Sean Owen commented on SPARK-15348:
---

I suspect that's waay outside the goals of the project and a huge piece of work

> Hive ACID
> -
>
> Key: SPARK-15348
> URL: https://issues.apache.org/jira/browse/SPARK-15348
> Project: Spark
>  Issue Type: New Feature
>Reporter: Ran Haim
>
> Spark does not support any feature of hive's transnational tables,
> you cannot use spark to delete/update a table and it also has problems 
> reading the aggregated data when no compaction was done.
> Also it seems that compaction is not supported - alter table ... partition 
>  COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15349) Hive ACID

2016-05-16 Thread Ran Haim (JIRA)

Ran Haim created SPARK-15349:


 Summary: Hive ACID
 Key: SPARK-15349
 URL: https://issues.apache.org/jira/browse/SPARK-15349
 Project: Spark
  Issue Type: New Feature
Reporter: Ran Haim


Spark does not support any feature of hive's transnational tables,
you cannot use spark to delete/update a table and it also has problems reading 
the aggregated data when no compaction was done.
Also it seems that compaction is not supported - alter table ... partition  
COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15348) Hive ACID

2016-05-16 Thread Ran Haim (JIRA)

Ran Haim created SPARK-15348:


 Summary: Hive ACID
 Key: SPARK-15348
 URL: https://issues.apache.org/jira/browse/SPARK-15348
 Project: Spark
  Issue Type: New Feature
Reporter: Ran Haim


Spark does not support any feature of hive's transnational tables,
you cannot use spark to delete/update a table and it also has problems reading 
the aggregated data when no compaction was done.
Also it seems that compaction is not supported - alter table ... partition  
COMPACT 'major'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15347) Problem select empty ORC table

2016-05-16 Thread Pedro Prado (JIRA)

Pedro Prado created SPARK-15347:
---

 Summary: Problem select empty ORC table
 Key: SPARK-15347
 URL: https://issues.apache.org/jira/browse/SPARK-15347
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.1
 Environment: Hadoop 2.7.1.2.4.2.0-258
Subversion g...@github.com:hortonworks/hadoop.git -r 
13debf893a605e8a88df18a7d8d214f571e05289
Compiled by jenkins on 2016-04-25T05:46Z
Compiled with protoc 2.5.0
>From source with checksum 2a2d95f05ec6c3ac547ed58cab713ac
This command was run using 
/usr/hdp/2.4.2.0-258/hadoop/hadoop-common-2.7.1.2.4.2.0-258.jar

Reporter: Pedro Prado
 Fix For: 1.6.0



Error when I selected empty ORC table

[pprado@hadoop-m ~]$ beeline -u jdbc:hive2://
WARNING: Use "yarn jar" to launch YARN applications.
Connecting to jdbc:hive2://
Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258)
Driver: Hive JDBC (version 1.2.1000.2.4.2.0-258)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1000.2.4.2.0-258 by Apache Hive

On beeline => create table my_test (id int, name String) stored as orc;
On beeline => select * from my_test;

16/05/13 18:18:57 [main]: ERROR hdfs.KeyProviderCache: Could not find uri 
with key [dfs.encryption.key.provider.uri] to create a keyProvider !!
OK
+-+---+--+
| my_test.id | my_test.name |
+-+---+--+
+-+---+--+
No rows selected (1.227 seconds)

Hive is OK!

Now, when i execute pyspark.

Welcome to
SPARK version 1.6.1

Using Python version 2.6.6 (r266:84292, Jul 23 2015 15:22:56)
SparkContext available as sc, HiveContext available as sqlContext.

PySpark => sqlContext.sql("select * from my_test")

16/05/13 18:33:41 INFO ParseDriver: Parsing command: select * from my_test
16/05/13 18:33:41 INFO ParseDriver: Parse Completed
Traceback (most recent call last):
File "", line 1, in
File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/context.py", line 580, 
in sql
return DataFrame(self.ssql_ctx.sql(sqlQuery), self)
File 
"/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", 
line 813, in __call_
File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/utils.py", line 53, in 
deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u'orcFileOperator: path 
hdfs://hadoop-m.c.sva-0001.internal:8020/apps/hive/warehouse/my_test does not 
have valid orc files matching the pattern'

when i create parquet table, it's all right. I do not have problem.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15346) Reduce duplicate computation in picking initial points in LocalKMeans

2016-05-16 Thread Abraham Zhan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abraham Zhan updated SPARK-15346:
-
Description: 
h2.Main Issue
I found that for KMans|| in mllib, when dataset is in large scale, after 
initial KMeans|| finishes and before Lloyd's iteration begins, the program will 
stuck for a long time without terminal. After testing I see it's stucked with 
LocalKMeans. And there is a to be improved feature in LocalKMeans.scala in 
Mllib. After picking each new initial centers, it's unnecessary to compute the 
distances between all the points and the old centers as below
{code:scala}
val costArray = points.map { point =>
  KMeans.fastSquaredDistance(point, centers(0))
}
{code}

Instead this we can keep the distance between all the points and their closest 
centers, and compare to the distance of them with the new center then update 
them.

h2.Test
Download 
[LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip]
I provided a attach "LocalKMeans.zip" which contains the code 
"LocalKMeans.scala" and dataset "bigKMeansMedia" 
LocalKMeans.scala contains both original version method KMeansPlusPlus and a 
modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0)
I added a tests and main function in it so that any one can run the file 
directly.

h3.How to Test
Replacing mllib.clustering.LocalKMeans.scala in your local repository with my 
LocalKMeans.scala. 
Modify the path in line 34 (loadAndRun()) with the path you restoring the data 
file bigKMeansMedia which is also provided in the patch. 
Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed to 
clustering number K and iteration number respectively. 
Then the console will print the cost time and SE of the two version of KMeans++ 
respectively.

h2.Test Results

This data is generated from a KMeans|| eperiment in spark, I add some inner 
function and output the result of KMeans|| initialization and restore.
The first line of the file with format "%d:%d:%d:%d" indicates "the 
seed:feature num:iteration num (in original KMeans||):points num" of the data. 

In my machine the experiment result is as below:

!https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg!
 the x-axis is the clustering num k while y-axis is the time in seconds

  was:
h2.Main Issue
I found the actually reason why GUI does not finish, which turns out that it's 
stuck with LocalKMeans. And there is a to be improved feature in 
LocalKMeans.scala in Mllib. After picking each new initial centers, it's 
unnecessary to compute the distances between all the points and the old centers 
as below
{code:scala}
val costArray = points.map { point =>
  KMeans.fastSquaredDistance(point, centers(0))
}
{code}

Instead this we can keep the distance between all the points and their closest 
centers, and compare to the distance of them with the new center then update 
them.

h2.Test
Download 
[LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip]
I provided a attach "LocalKMeans.zip" which contains the code 
"LocalKMeans.scala" and dataset "bigKMeansMedia" 
LocalKMeans.scala contains both original version method KMeansPlusPlus and a 
modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0)
I added a tests and main function in it so that any one can run the file 
directly.

h3.How to Test
Replacing mllib.clustering.LocalKMeans.scala in your local repository with my 
LocalKMeans.scala. 
Modify the path in line 34 (loadAndRun()) with the path you restoring the data 
file bigKMeansMedia which is also provided in the patch. 
Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed to 
clustering number K and iteration number respectively. 
Then the console will print the cost time and SE of the two version of KMeans++ 
respectively.

h2.Test Results

This data is generated from a KMeans|| eperiment in spark, I add some inner 
function and output the result of KMeans|| initialization and restore.
The first line of the file with format "%d:%d:%d:%d" indicates "the 
seed:feature num:iteration num (in original KMeans||):points num" of the data. 

In my machine the experiment result is as below:

!https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg!
 the x-axis is the clustering num k while y-axis is the time in seconds


> Reduce duplicate computation in picking initial points in LocalKMeans
> -
>
> Key: SPARK-15346
> URL: https://issues.apache.org/jira/browse/SPARK-15346
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
> Environment: Ubuntu 14.04
>Reporter: Abraham Zhan
>  Labels: performance
>
> h2.Main Issue
> I

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284623#comment-15284623
 ] 

Sean Owen commented on SPARK-15343:
---

SInce you're executing in a cluster, I think perhaps a better and more 
canonical solution is to build with "-Phadoop-provided" and get the Hadoop 
dependencies from the cluster? then you're inheriting the version that's 
consistent with the cluster config.

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at

[jira] [Comment Edited] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-05-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284619#comment-15284619
 ] 

Maciej Bryński edited comment on SPARK-15343 at 5/16/16 2:05 PM:
-

Thanks.

I set spark.hadoop.yarn.timeline-service.enabled to false.
It's nasty workaround but it works.


was (Author: maver1ck):
I set spark.hadoop.yarn.timeline-service.enabled to false.
It's nasty workaround but it works.

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-05-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284619#comment-15284619
 ] 

Maciej Bryński commented on SPARK-15343:


I set spark.hadoop.yarn.timeline-service.enabled to false.
It's nasty workaround but it works.

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at

[jira] [Issue Comment Deleted] (SPARK-4924) Factor out code to launch Spark applications into a separate library

2016-05-16 Thread Stephen Boesch (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Boesch updated SPARK-4924:
--
Comment: was deleted

(was: Chiming in here as well: three of us are now asking for commentary / 
pointers to the following:

*  What capabilities have been added to the spark api
* How do we use them
* Any examples / other relevant documentation and/or code

Just saying "read the documentation" is not acceptable guidance for how to use 
these added features.)

> Factor out code to launch Spark applications into a separate library
> 
>
> Key: SPARK-4924
> URL: https://issues.apache.org/jira/browse/SPARK-4924
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.4.0
>
> Attachments: spark-launcher.txt
>
>
> One of the questions we run into rather commonly is "how to start a Spark 
> application from my Java/Scala program?". There currently isn't a good answer 
> to that:
> - Instantiating SparkContext has limitations (e.g., you can only have one 
> active context at the moment, plus you lose the ability to submit apps in 
> cluster mode)
> - Calling SparkSubmit directly is doable but you lose a lot of the logic 
> handled by the shell scripts
> - Calling the shell script directly is doable,  but sort of ugly from an API 
> point of view.
> I think it would be nice to have a small library that handles that for users. 
> On top of that, this library could be used by Spark itself to replace a lot 
> of the code in the current shell scripts, which have a lot of duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12154) Upgrade to Jersey 2

2016-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284617#comment-15284617
 ] 

Sean Owen commented on SPARK-12154:
---

No, I don't think so - let's keep the discussion in one place on the other JIRA

> Upgrade to Jersey 2
> ---
>
> Key: SPARK-12154
> URL: https://issues.apache.org/jira/browse/SPARK-12154
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Spark Core
>Affects Versions: 1.5.2
>Reporter: Matt Cheah
>Assignee: Matt Cheah
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. 
> Library conflicts for Jersey are difficult to workaround - see discussion on 
> SPARK-11081. It's easier to upgrade Jersey entirely, but we should target 
> Spark 2.0 since this may be a break for users who were using Jersey 1 in 
> their Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284612#comment-15284612
 ] 

Sean Owen commented on SPARK-15343:
---

Yes, of course that's the change that caused the behavior you're seeing, but it 
should be OK for all of Spark's usages. At least, that was the conclusion 
before, and all of the Spark tests work.

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284538#comment-15284538
 ] 

Sean Owen commented on SPARK-15343:
---

No, it's clearly a class needed by YARN and that's where it fails -- have a 
look at the stack. Yes, YARN certainly is the one using Jersey 1.x and it is in 
a different namespace. When this came up before I was wondering if we needed to 
adjust exclusions to allow both into the assembly, but have a look at this: 
http://apache-spark-developers-list.1001551.n3.nabble.com/spark-2-0-issue-with-yarn-td17440.html#a17448
  I think the conclusion was that the thing that needs Jersey isn't a part of 
Spark?

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at

[jira] [Commented] (SPARK-12154) Upgrade to Jersey 2

2016-05-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284509#comment-15284509
 ] 

Maciej Bryński commented on SPARK-12154:


I think this upgrade break compatibility with YARN.
https://issues.apache.org/jira/browse/SPARK-15343

> Upgrade to Jersey 2
> ---
>
> Key: SPARK-12154
> URL: https://issues.apache.org/jira/browse/SPARK-12154
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Spark Core
>Affects Versions: 1.5.2
>Reporter: Matt Cheah
>Assignee: Matt Cheah
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. 
> Library conflicts for Jersey are difficult to workaround - see discussion on 
> SPARK-11081. It's easier to upgrade Jersey entirely, but we should target 
> Spark 2.0 since this may be a break for users who were using Jersey 1 in 
> their Spark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-05-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284507#comment-15284507
 ] 

Maciej Bryński commented on SPARK-15343:


And the likely reason of problem.
https://issues.apache.org/jira/browse/SPARK-12154

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at

[jira] [Comment Edited] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-05-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284506#comment-15284506
 ] 

Maciej Bryński edited comment on SPARK-15343 at 5/16/16 1:52 PM:
-

I think it's too early for that. 
Exception is thrown on JavaSparkContext initialization. So before connection to 
YARN.

I checked jersey-client-1.19.1.jar and 
com/sun/jersey/api/client/config/ClientConfig is inside.
Maybe we should include both versions of this library?


was (Author: maver1ck):
I think it's too early for that. 
Exception is thrown on JavaSparkContext initialization. So before connection to 
YARN.

I checked jersey-client-1.19.1.jar and 
com/sun/jersey/api/client/config/ClientConfig is inside.
Maybe we should include both versions ?

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
>

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-05-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284506#comment-15284506
 ] 

Maciej Bryński commented on SPARK-15343:


I think it's too early for that. 
Exception is thrown on JavaSparkContext initialization. So before connection to 
YARN.

I checked jersey-client-1.19.1.jar and 
com/sun/jersey/api/client/config/ClientConfig is inside.
Maybe we should include both versions ?

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at

[jira] [Assigned] (SPARK-15346) Reduce duplicate computation in picking initial points in LocalKMeans

2016-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15346:


Assignee: Apache Spark

> Reduce duplicate computation in picking initial points in LocalKMeans
> -
>
> Key: SPARK-15346
> URL: https://issues.apache.org/jira/browse/SPARK-15346
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
> Environment: Ubuntu 14.04
>Reporter: Abraham Zhan
>Assignee: Apache Spark
>  Labels: performance
>
> h2.Main Issue
> I found the actually reason why GUI does not finish, which turns out that 
> it's stuck with LocalKMeans. And there is a to be improved feature in 
> LocalKMeans.scala in Mllib. After picking each new initial centers, it's 
> unnecessary to compute the distances between all the points and the old 
> centers as below
> {code:scala}
> val costArray = points.map { point =>
>   KMeans.fastSquaredDistance(point, centers(0))
> }
> {code}
> Instead this we can keep the distance between all the points and their 
> closest centers, and compare to the distance of them with the new center then 
> update them.
> h2.Test
> Download 
> [LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip]
> I provided a attach "LocalKMeans.zip" which contains the code 
> "LocalKMeans.scala" and dataset "bigKMeansMedia" 
> LocalKMeans.scala contains both original version method KMeansPlusPlus and a 
> modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0)
> I added a tests and main function in it so that any one can run the file 
> directly.
> h3.How to Test
> Replacing mllib.clustering.LocalKMeans.scala in your local repository with my 
> LocalKMeans.scala. 
> Modify the path in line 34 (loadAndRun()) with the path you restoring the 
> data file bigKMeansMedia which is also provided in the patch. 
> Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed 
> to clustering number K and iteration number respectively. 
> Then the console will print the cost time and SE of the two version of 
> KMeans++ respectively.
> h2.Test Results
> This data is generated from a KMeans|| eperiment in spark, I add some inner 
> function and output the result of KMeans|| initialization and restore.
> The first line of the file with format "%d:%d:%d:%d" indicates "the 
> seed:feature num:iteration num (in original KMeans||):points num" of the 
> data. 
> In my machine the experiment result is as below:
> !https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg!
>  the x-axis is the clustering num k while y-axis is the time in seconds



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15346) Reduce duplicate computation in picking initial points in LocalKMeans

2016-05-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284498#comment-15284498
 ] 

Apache Spark commented on SPARK-15346:
--

User 'mouendless' has created a pull request for this issue:
https://github.com/apache/spark/pull/13133

> Reduce duplicate computation in picking initial points in LocalKMeans
> -
>
> Key: SPARK-15346
> URL: https://issues.apache.org/jira/browse/SPARK-15346
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
> Environment: Ubuntu 14.04
>Reporter: Abraham Zhan
>  Labels: performance
>
> h2.Main Issue
> I found the actually reason why GUI does not finish, which turns out that 
> it's stuck with LocalKMeans. And there is a to be improved feature in 
> LocalKMeans.scala in Mllib. After picking each new initial centers, it's 
> unnecessary to compute the distances between all the points and the old 
> centers as below
> {code:scala}
> val costArray = points.map { point =>
>   KMeans.fastSquaredDistance(point, centers(0))
> }
> {code}
> Instead this we can keep the distance between all the points and their 
> closest centers, and compare to the distance of them with the new center then 
> update them.
> h2.Test
> Download 
> [LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip]
> I provided a attach "LocalKMeans.zip" which contains the code 
> "LocalKMeans.scala" and dataset "bigKMeansMedia" 
> LocalKMeans.scala contains both original version method KMeansPlusPlus and a 
> modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0)
> I added a tests and main function in it so that any one can run the file 
> directly.
> h3.How to Test
> Replacing mllib.clustering.LocalKMeans.scala in your local repository with my 
> LocalKMeans.scala. 
> Modify the path in line 34 (loadAndRun()) with the path you restoring the 
> data file bigKMeansMedia which is also provided in the patch. 
> Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed 
> to clustering number K and iteration number respectively. 
> Then the console will print the cost time and SE of the two version of 
> KMeans++ respectively.
> h2.Test Results
> This data is generated from a KMeans|| eperiment in spark, I add some inner 
> function and output the result of KMeans|| initialization and restore.
> The first line of the file with format "%d:%d:%d:%d" indicates "the 
> seed:feature num:iteration num (in original KMeans||):points num" of the 
> data. 
> In my machine the experiment result is as below:
> !https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg!
>  the x-axis is the clustering num k while y-axis is the time in seconds



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15346) Reduce duplicate computation in picking initial points in LocalKMeans

2016-05-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15346:


Assignee: (was: Apache Spark)

> Reduce duplicate computation in picking initial points in LocalKMeans
> -
>
> Key: SPARK-15346
> URL: https://issues.apache.org/jira/browse/SPARK-15346
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
> Environment: Ubuntu 14.04
>Reporter: Abraham Zhan
>  Labels: performance
>
> h2.Main Issue
> I found the actually reason why GUI does not finish, which turns out that 
> it's stuck with LocalKMeans. And there is a to be improved feature in 
> LocalKMeans.scala in Mllib. After picking each new initial centers, it's 
> unnecessary to compute the distances between all the points and the old 
> centers as below
> {code:scala}
> val costArray = points.map { point =>
>   KMeans.fastSquaredDistance(point, centers(0))
> }
> {code}
> Instead this we can keep the distance between all the points and their 
> closest centers, and compare to the distance of them with the new center then 
> update them.
> h2.Test
> Download 
> [LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip]
> I provided a attach "LocalKMeans.zip" which contains the code 
> "LocalKMeans.scala" and dataset "bigKMeansMedia" 
> LocalKMeans.scala contains both original version method KMeansPlusPlus and a 
> modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0)
> I added a tests and main function in it so that any one can run the file 
> directly.
> h3.How to Test
> Replacing mllib.clustering.LocalKMeans.scala in your local repository with my 
> LocalKMeans.scala. 
> Modify the path in line 34 (loadAndRun()) with the path you restoring the 
> data file bigKMeansMedia which is also provided in the patch. 
> Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed 
> to clustering number K and iteration number respectively. 
> Then the console will print the cost time and SE of the two version of 
> KMeans++ respectively.
> h2.Test Results
> This data is generated from a KMeans|| eperiment in spark, I add some inner 
> function and output the result of KMeans|| initialization and restore.
> The first line of the file with format "%d:%d:%d:%d" indicates "the 
> seed:feature num:iteration num (in original KMeans||):points num" of the 
> data. 
> In my machine the experiment result is as below:
> !https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg!
>  the x-axis is the clustering num k while y-axis is the time in seconds



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15346) Reduce duplicate computation in picking initial points in LocalKMeans

2016-05-16 Thread Abraham Zhan (JIRA)

Abraham Zhan created SPARK-15346:


 Summary: Reduce duplicate computation in picking initial points in 
LocalKMeans
 Key: SPARK-15346
 URL: https://issues.apache.org/jira/browse/SPARK-15346
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
 Environment: Ubuntu 14.04
Reporter: Abraham Zhan


h2.Main Issue
I found the actually reason why GUI does not finish, which turns out that it's 
stuck with LocalKMeans. And there is a to be improved feature in 
LocalKMeans.scala in Mllib. After picking each new initial centers, it's 
unnecessary to compute the distances between all the points and the old centers 
as below
{code:scala}
val costArray = points.map { point =>
  KMeans.fastSquaredDistance(point, centers(0))
}
{code}

Instead this we can keep the distance between all the points and their closest 
centers, and compare to the distance of them with the new center then update 
them.

h2.Test
Download 
[LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip]
I provided a attach "LocalKMeans.zip" which contains the code 
"LocalKMeans.scala" and dataset "bigKMeansMedia" 
LocalKMeans.scala contains both original version method KMeansPlusPlus and a 
modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0)
I added a tests and main function in it so that any one can run the file 
directly.

h3.How to Test
Replacing mllib.clustering.LocalKMeans.scala in your local repository with my 
LocalKMeans.scala. 
Modify the path in line 34 (loadAndRun()) with the path you restoring the data 
file bigKMeansMedia which is also provided in the patch. 
Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed to 
clustering number K and iteration number respectively. 
Then the console will print the cost time and SE of the two version of KMeans++ 
respectively.

h2.Test Results

This data is generated from a KMeans|| eperiment in spark, I add some inner 
function and output the result of KMeans|| initialization and restore.
The first line of the file with format "%d:%d:%d:%d" indicates "the 
seed:feature num:iteration num (in original KMeans||):points num" of the data. 

In my machine the experiment result is as below:

!https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg!
 the x-axis is the clustering num k while y-axis is the time in seconds



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks

2016-05-16 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284486#comment-15284486
 ] 

Takeshi Yamamuro commented on SPARK-15247:
--

Not yet. Actually, I'm not sure that this issue needs to be fixed.

> sqlCtx.read.parquet yields at least n_executors * n_cores tasks
> ---
>
> Key: SPARK-15247
> URL: https://issues.apache.org/jira/browse/SPARK-15247
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Johnny W.
>
> sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even 
> though this is only 1 very small file
> This issue can increase the latency for small jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4924) Factor out code to launch Spark applications into a separate library

2016-05-16 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284483#comment-15284483
 ] 

Thomas Graves commented on SPARK-4924:
--

[~javadba]  If you have ideas on improving the documentation please file a jira 
and point them out or make suggestions.

The java api is mentioned in the programming guide: 
http://spark.apache.org/docs/1.6.0/programming-guide.html#launching-spark-jobs-from-java--scala



> Factor out code to launch Spark applications into a separate library
> 
>
> Key: SPARK-4924
> URL: https://issues.apache.org/jira/browse/SPARK-4924
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.4.0
>
> Attachments: spark-launcher.txt
>
>
> One of the questions we run into rather commonly is "how to start a Spark 
> application from my Java/Scala program?". There currently isn't a good answer 
> to that:
> - Instantiating SparkContext has limitations (e.g., you can only have one 
> active context at the moment, plus you lose the ability to submit apps in 
> cluster mode)
> - Calling SparkSubmit directly is doable but you lose a lot of the logic 
> handled by the shell scripts
> - Calling the shell script directly is doable,  but sort of ugly from an API 
> point of view.
> I think it would be nice to have a small library that handles that for users. 
> On top of that, this library could be used by Spark itself to replace a lot 
> of the code in the current shell scripts, which have a lot of duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284480#comment-15284480
 ] 

Sean Owen commented on SPARK-15343:
---

Yeah, though in theory that doesn't prevent it from being pulled in by YARN 
from its own copy. You should have YARN being 'provided' at runtime by the 
cluster -- not bundled  in your app though right?

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-05-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284465#comment-15284465
 ] 

Maciej Bryński commented on SPARK-15343:


[~srowen]
I found that we change version of jersey library from 1.9 
(https://github.com/apache/spark/blob/branch-1.6/pom.xml#L182) to 2.22.2 
(https://github.com/apache/spark/blob/master/pom.xml#L175).
Maybe that's the reason.

> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at

[jira] [Commented] (SPARK-14881) pyspark and sparkR shell default log level should match spark-shell/Scala

2016-05-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-14881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284454#comment-15284454
 ] 

Maciej Bryński commented on SPARK-14881:


[~felixcheung]
Could you check this ?
https://issues.apache.org/jira/browse/SPARK-15344

> pyspark and sparkR shell default log level should match spark-shell/Scala
> -
>
> Key: SPARK-14881
> URL: https://issues.apache.org/jira/browse/SPARK-14881
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, SparkR
>Affects Versions: 2.0.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.0
>
>
> Scala spark-shell defaults to log level WARN. pyspark and sparkR should match 
> that by default (user can change it later)
> # ./bin/spark-shell
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15344) Unable to set default log level for PySpark

2016-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284449#comment-15284449
 ] 

Sean Owen commented on SPARK-15344:
---

I know, but I'm suggesting it's probably more useful to continue or reopen the 
original issue if it didn't work.

> Unable to set default log level for PySpark
> ---
>
> Key: SPARK-15344
> URL: https://issues.apache.org/jira/browse/SPARK-15344
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> After this patch:
> https://github.com/apache/spark/pull/12648
> I'm unable to set default log level for Pyspark.
> It's always WARN.
> Below setting doesn't work: 
> {code}
> mbrynski@jupyter:~/spark$ cat conf/log4j.properties
> # Set everything to be logged to the console
> log4j.rootCategory=INFO, console
> log4j.appender.console=org.apache.log4j.ConsoleAppender
> log4j.appender.console.target=System.err
> log4j.appender.console.layout=org.apache.log4j.PatternLayout
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
> %c{1}: %m%n
> # Set the default spark-shell log level to WARN. When running the 
> spark-shell, the
> # log level for this class is used to overwrite the root logger's log level, 
> so that
> # the user can have different defaults for the shell and regular Spark apps.
> log4j.logger.org.apache.spark.repl.Main=INFO
> # Settings to quiet third party logs that are too verbose
> log4j.logger.org.spark_project.jetty=WARN
> log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
> log4j.logger.org.apache.parquet=ERROR
> log4j.logger.parquet=ERROR
> # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
> UDFs in SparkSQL with Hive support
> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15344) Unable to set default log level for PySpark

2016-05-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-15344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284442#comment-15284442
 ] 

Maciej Bryński edited comment on SPARK-15344 at 5/16/16 12:44 PM:
--

Yep.
I mentioned PR from this Jira in description.


was (Author: maver1ck):
Yep.
I mention PR from this Jira in description.

> Unable to set default log level for PySpark
> ---
>
> Key: SPARK-15344
> URL: https://issues.apache.org/jira/browse/SPARK-15344
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> After this patch:
> https://github.com/apache/spark/pull/12648
> I'm unable to set default log level for Pyspark.
> It's always WARN.
> Below setting doesn't work: 
> {code}
> mbrynski@jupyter:~/spark$ cat conf/log4j.properties
> # Set everything to be logged to the console
> log4j.rootCategory=INFO, console
> log4j.appender.console=org.apache.log4j.ConsoleAppender
> log4j.appender.console.target=System.err
> log4j.appender.console.layout=org.apache.log4j.PatternLayout
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
> %c{1}: %m%n
> # Set the default spark-shell log level to WARN. When running the 
> spark-shell, the
> # log level for this class is used to overwrite the root logger's log level, 
> so that
> # the user can have different defaults for the shell and regular Spark apps.
> log4j.logger.org.apache.spark.repl.Main=INFO
> # Settings to quiet third party logs that are too verbose
> log4j.logger.org.spark_project.jetty=WARN
> log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
> log4j.logger.org.apache.parquet=ERROR
> log4j.logger.parquet=ERROR
> # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
> UDFs in SparkSQL with Hive support
> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15344) Unable to set default log level for PySpark

2016-05-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-15344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284442#comment-15284442
 ] 

Maciej Bryński commented on SPARK-15344:


Yep.
I mention PR from this Jira in description.

> Unable to set default log level for PySpark
> ---
>
> Key: SPARK-15344
> URL: https://issues.apache.org/jira/browse/SPARK-15344
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> After this patch:
> https://github.com/apache/spark/pull/12648
> I'm unable to set default log level for Pyspark.
> It's always WARN.
> Below setting doesn't work: 
> {code}
> mbrynski@jupyter:~/spark$ cat conf/log4j.properties
> # Set everything to be logged to the console
> log4j.rootCategory=INFO, console
> log4j.appender.console=org.apache.log4j.ConsoleAppender
> log4j.appender.console.target=System.err
> log4j.appender.console.layout=org.apache.log4j.PatternLayout
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
> %c{1}: %m%n
> # Set the default spark-shell log level to WARN. When running the 
> spark-shell, the
> # log level for this class is used to overwrite the root logger's log level, 
> so that
> # the user can have different defaults for the shell and regular Spark apps.
> log4j.logger.org.apache.spark.repl.Main=INFO
> # Settings to quiet third party logs that are too verbose
> log4j.logger.org.spark_project.jetty=WARN
> log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
> log4j.logger.org.apache.parquet=ERROR
> log4j.logger.parquet=ERROR
> # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
> UDFs in SparkSQL with Hive support
> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15344) Unable to set default log level for PySpark

2016-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284438#comment-15284438
 ] 

Sean Owen commented on SPARK-15344:
---

Comment on SPARK-14881 then maybe? this sounds like a duplicate or wholly 
related.
CC [~felixcheung]

> Unable to set default log level for PySpark
> ---
>
> Key: SPARK-15344
> URL: https://issues.apache.org/jira/browse/SPARK-15344
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> After this patch:
> https://github.com/apache/spark/pull/12648
> I'm unable to set default log level for Pyspark.
> It's always WARN.
> Below setting doesn't work: 
> {code}
> mbrynski@jupyter:~/spark$ cat conf/log4j.properties
> # Set everything to be logged to the console
> log4j.rootCategory=INFO, console
> log4j.appender.console=org.apache.log4j.ConsoleAppender
> log4j.appender.console.target=System.err
> log4j.appender.console.layout=org.apache.log4j.PatternLayout
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
> %c{1}: %m%n
> # Set the default spark-shell log level to WARN. When running the 
> spark-shell, the
> # log level for this class is used to overwrite the root logger's log level, 
> so that
> # the user can have different defaults for the shell and regular Spark apps.
> log4j.logger.org.apache.spark.repl.Main=INFO
> # Settings to quiet third party logs that are too verbose
> log4j.logger.org.spark_project.jetty=WARN
> log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
> log4j.logger.org.apache.parquet=ERROR
> log4j.logger.parquet=ERROR
> # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
> UDFs in SparkSQL with Hive support
> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15345) Cannot connect to Hive databases

2016-05-16 Thread Piotr Milanowski (JIRA)

Piotr Milanowski created SPARK-15345:


 Summary: Cannot connect to Hive databases
 Key: SPARK-15345
 URL: https://issues.apache.org/jira/browse/SPARK-15345
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Piotr Milanowski


I am working with branch-2.0, spark is compiled with hive support (-Phive and 
-Phvie-thriftserver).
I am trying to access databases using this snippet:
{code}
from pyspark.sql import HiveContext
hc = HiveContext(sc)
hc.sql("show databases").collect()
[Row(result='default')]
{code}

This means that spark doesn't find any databases specified in configuration.
Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 
1.6, and launching above snippet, I can print out existing databases.

When run in DEBUG mode this is what spark (2.0) prints out:

{code}
16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases
16/05/16 12:17:47 DEBUG SimpleAnalyzer: 
=== Result of Batch Resolution ===
!'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, 
string])) null else input[0, string].toString, 
StructField(result,StringType,false)), result#2) AS #3]   Project 
[createexternalrow(if (isnull(result#2)) null else result#2.toString, 
StructField(result,StringType,false)) AS #3]
 +- LocalRelation [result#2]

 +- LocalRelation [result#2]

16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
(org.apache.spark.sql.Dataset$$anonfun$53) +++
16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID
16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
org.apache.spark.sql.types.StructType 
org.apache.spark.sql.Dataset$$anonfun$53.structType$1
16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object)
16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow)
16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
this is the starting closure
16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting closure: 0
16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
(org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++
16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
(org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1) 
+++
16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 1
16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID
16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared methods: 2
16/05/16 12:17:47 DEBUG ClosureCleaner:  public final java.lang.Object 
org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object)
16/05/16 12:17:47 DEBUG ClosureCleaner:  public final 
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler 
org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator)
16/05/16 12:17:47 DEBUG ClosureCleaner:  + inner classes: 0
16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer classes: 0
16/05/16 12:17:47 DEBUG ClosureCleaner:  + outer objects: 0
16/05/16 12:17:47 DEBUG ClosureCleaner:  + populating accessed fields because 
this is the starting closure
16/05/16 12:17:47 DEBUG ClosureCleaner:  + fields accessed by starting closure: 0
16/05/16 12:17:47 DEBUG ClosureCleaner:  + there are no enclosing objects!
16/05/16 12:17:47 DEBUG ClosureCleaner:  +++ closure  
(org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1) 
is now cleaned +++
16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure  
(org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13) +++
16/05/16 12:17:47 DEBUG ClosureCleaner:  + declared fields: 2
16/05/16 12:17:47 DEBUG ClosureCleaner:  public static final long 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.serialVersionUID
16/05/16 12:17:47 DEBUG ClosureCleaner:  private final 
org.apache.spark.rdd.RDD$$anonfun$collect$1 
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.$outer
16/05/16 12:17:47 DEBUG

[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN

2016-05-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284433#comment-15284433
 ] 

Maciej Bryński commented on SPARK-15343:


CC: [~vanzin]


> NoClassDefFoundError when initializing Spark with YARN
> --
>
> Key: SPARK-15343
> URL: https://issues.apache.org/jira/browse/SPARK-15343
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop.
> Spark compiled with:
> {code}
> ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver 
> -Dhadoop.version=2.6.0 -DskipTests
> {code}
> I'm getting following error
> {code}
> mbrynski@jupyter:~/spark$ bin/pyspark
> Python 3.4.0 (default, Apr 11 2014, 13:05:11)
> [GCC 4.8.2] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" 
> with specified deploy mode instead.
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has 
> been deprecated as of Spark 2.0 and may be removed in the future. Please use 
> the new key 'spark.yarn.jars' instead.
> 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/05/16 11:54:42 WARN AbstractHandler: No Server set for 
> org.spark_project.jetty.server.handler.ErrorHandler@f7989f6
> 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads 
> feature cannot be used because libhadoop cannot be loaded.
> Traceback (most recent call last):
>   File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in 
> sc = SparkContext()
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__
> conf, jsc, profiler_cls)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init
> self._jsc = jsc or self._initialize_context(self._conf._jconf)
>   File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in 
> _initialize_context
> return self._jvm.JavaSparkContext(jconf)
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 1183, in __call__
>   File 
> "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 
> 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> None.org.apache.spark.api.java.JavaSparkContext.
> : java.lang.NoClassDefFoundError: 
> com/sun/jersey/api/client/config/ClientConfig
> at 
> org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45)
> at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150)
> at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148)
> at org.apache.spark.SparkContext.(SparkContext.scala:502)
> at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:236)
> at 
> py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
> at 
> py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassNotFoundException: 
> com.sun.jersey.api.client.config.ClientConfig
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 19 more
> {code}
> On 1.6 everything

[jira] [Commented] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks

2016-05-16 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284432#comment-15284432
 ] 

Sean Owen commented on SPARK-15247:
---

Did you actually open a PR for this?

> sqlCtx.read.parquet yields at least n_executors * n_cores tasks
> ---
>
> Key: SPARK-15247
> URL: https://issues.apache.org/jira/browse/SPARK-15247
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Johnny W.
>
> sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even 
> though this is only 1 very small file
> This issue can increase the latency for small jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 159 matches

Mail list logo