[jira] [Comment Edited] (SPARK-15227) InputStream stop-start semantics + empty implementations
[ https://issues.apache.org/jira/browse/SPARK-15227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286089#comment-15286089 ] Prashant Sharma edited comment on SPARK-15227 at 5/17/16 5:44 AM: -- If start and stop are overridden by a particular DStream, they are called when the streams are started(to do some intialization) and stopped (to do so cleanup). However, if there is nothing to initialize and cleanup - then they can be left empty. Pause and resume is very different from start and stop. For example, if you pause - what happens to the incoming stream. They are buffered or they are dropped ? Those semantics need to be discussed, before we can talk about that. It is possible to implement it by having a custom receiver. I am not sure, since the development efforts are shifted towards the structured streaming, it will be interesting to see - how this sort of thing gets implemented. was (Author: prashant_): If start and stop are overridden by a particular DStream, they are called when the streams are started(to do some intialization) and stopped (to do so cleanup). However, if there is nothing to initialize and cleanup - then they can be left empty. Pause and resume is very different from start and stop. For example, if you pause - what happens to the incoming stream. They are buffered or they are dropped ? Those semantics need to be discussed, before we can talk about that. It is possible to implement it by having a custom receiver. I am not sure, but since the development efforts are shifted towards the structured streaming, it will be interesting to see - how this sort of thing gets implemented. > InputStream stop-start semantics + empty implementations > > > Key: SPARK-15227 > URL: https://issues.apache.org/jira/browse/SPARK-15227 > Project: Spark > Issue Type: Improvement > Components: Input/Output, Streaming >Affects Versions: 1.6.1 >Reporter: Stas Levin >Priority: Minor > > Hi, > Seems like quite a few InputStream(s) currently leave the start and stop > methods empty. > I was hoping to hear your thoughts on: > 1. Whether there were any particular reasons to leave these methods empty ? > 2. Do the stop/start semantics of InputStream(s) aim to support pause-resume > use-cases, or is it a one way ticket? > A pause-resume kind of thing could be really useful for cases where one > wishes to load new offline data for the streaming app to run on top of, > without restarting the entire app. > Thanks a lot, > Stas -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15227) InputStream stop-start semantics + empty implementations
[ https://issues.apache.org/jira/browse/SPARK-15227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286089#comment-15286089 ] Prashant Sharma commented on SPARK-15227: - If start and stop are overridden by a particular DStream, they are called when the streams are started(to do some intialization) and stopped (to do so cleanup). However, if there is nothing to initialize and cleanup - then they can be left empty. Pause and resume is very different from start and stop. For example, if you pause - what happens to the incoming stream. They are buffered or they are dropped ? Those semantics need to be discussed, before we can talk about that. It is possible to implement it by having a custom receiver. I am not sure, but since the development efforts are shifted towards the structured streaming, it will be interesting to see - how this sort of thing gets implemented. > InputStream stop-start semantics + empty implementations > > > Key: SPARK-15227 > URL: https://issues.apache.org/jira/browse/SPARK-15227 > Project: Spark > Issue Type: Improvement > Components: Input/Output, Streaming >Affects Versions: 1.6.1 >Reporter: Stas Levin >Priority: Minor > > Hi, > Seems like quite a few InputStream(s) currently leave the start and stop > methods empty. > I was hoping to hear your thoughts on: > 1. Whether there were any particular reasons to leave these methods empty ? > 2. Do the stop/start semantics of InputStream(s) aim to support pause-resume > use-cases, or is it a one way ticket? > A pause-resume kind of thing could be really useful for cases where one > wishes to load new offline data for the streaming app to run on top of, > without restarting the entire app. > Thanks a lot, > Stas -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15344) Unable to set default log level for PySpark
[ https://issues.apache.org/jira/browse/SPARK-15344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286039#comment-15286039 ] Felix Cheung commented on SPARK-15344: -- This was the original change: https://issues.apache.org/jira/browse/SPARK-11929 > Unable to set default log level for PySpark > --- > > Key: SPARK-15344 > URL: https://issues.apache.org/jira/browse/SPARK-15344 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Minor > > After this patch: > https://github.com/apache/spark/pull/12648 > I'm unable to set default log level for Pyspark. > It's always WARN. > Below setting doesn't work: > {code} > mbrynski@jupyter:~/spark$ cat conf/log4j.properties > # Set everything to be logged to the console > log4j.rootCategory=INFO, console > log4j.appender.console=org.apache.log4j.ConsoleAppender > log4j.appender.console.target=System.err > log4j.appender.console.layout=org.apache.log4j.PatternLayout > log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p > %c{1}: %m%n > # Set the default spark-shell log level to WARN. When running the > spark-shell, the > # log level for this class is used to overwrite the root logger's log level, > so that > # the user can have different defaults for the shell and regular Spark apps. > log4j.logger.org.apache.spark.repl.Main=INFO > # Settings to quiet third party logs that are too verbose > log4j.logger.org.spark_project.jetty=WARN > log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR > log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO > log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO > log4j.logger.org.apache.parquet=ERROR > log4j.logger.parquet=ERROR > # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent > UDFs in SparkSQL with Hive support > log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL > log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15344) Unable to set default log level for PySpark
[ https://issues.apache.org/jira/browse/SPARK-15344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286038#comment-15286038 ] Felix Cheung commented on SPARK-15344: -- SPARK-14881 was to get pyspark and sparkR shell to match the new default behavior of spark-shell (Scala). As you can see here, it will always set the default to WARN: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala#L135 I agree it makes sense if log4j-defaults.properties is there we should keep log level set there, for all shell/REPL cases. > Unable to set default log level for PySpark > --- > > Key: SPARK-15344 > URL: https://issues.apache.org/jira/browse/SPARK-15344 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Minor > > After this patch: > https://github.com/apache/spark/pull/12648 > I'm unable to set default log level for Pyspark. > It's always WARN. > Below setting doesn't work: > {code} > mbrynski@jupyter:~/spark$ cat conf/log4j.properties > # Set everything to be logged to the console > log4j.rootCategory=INFO, console > log4j.appender.console=org.apache.log4j.ConsoleAppender > log4j.appender.console.target=System.err > log4j.appender.console.layout=org.apache.log4j.PatternLayout > log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p > %c{1}: %m%n > # Set the default spark-shell log level to WARN. When running the > spark-shell, the > # log level for this class is used to overwrite the root logger's log level, > so that > # the user can have different defaults for the shell and regular Spark apps. > log4j.logger.org.apache.spark.repl.Main=INFO > # Settings to quiet third party logs that are too verbose > log4j.logger.org.spark_project.jetty=WARN > log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR > log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO > log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO > log4j.logger.org.apache.parquet=ERROR > log4j.logger.parquet=ERROR > # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent > UDFs in SparkSQL with Hive support > log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL > log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13850) TimSort Comparison method violates its general contract
[ https://issues.apache.org/jira/browse/SPARK-13850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285980#comment-15285980 ] Yin Huai commented on SPARK-13850: -- Can you explain the root cause at here? > TimSort Comparison method violates its general contract > --- > > Key: SPARK-13850 > URL: https://issues.apache.org/jira/browse/SPARK-13850 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.6.0 >Reporter: Sital Kedia > > While running a query which does a group by on a large dataset, the query > fails with following stack trace. > {code} > Job aborted due to stage failure: Task 4077 in stage 1.3 failed 4 times, most > recent failure: Lost task 4077.3 in stage 1.3 (TID 88702, > hadoop3030.prn2.facebook.com): java.lang.IllegalArgumentException: Comparison > method violates its general contract! > at > org.apache.spark.util.collection.TimSort$SortState.mergeLo(TimSort.java:794) > at > org.apache.spark.util.collection.TimSort$SortState.mergeAt(TimSort.java:525) > at > org.apache.spark.util.collection.TimSort$SortState.mergeCollapse(TimSort.java:453) > at > org.apache.spark.util.collection.TimSort$SortState.access$200(TimSort.java:325) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:153) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:228) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:186) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:175) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249) > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:318) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:333) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Please note that the same query used to succeed in Spark 1.5 so it seems like > a regression in 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15292) ML 2.0 QA: Scala APIs audit for classification
[ https://issues.apache.org/jira/browse/SPARK-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15292: -- Assignee: Yanbo Liang Target Version/s: 2.0.0 > ML 2.0 QA: Scala APIs audit for classification > -- > > Key: SPARK-15292 > URL: https://issues.apache.org/jira/browse/SPARK-15292 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Audit Scala API for classification, almost all issues were related > MultilayerPerceptronClassifier. > * Fix one wrong param getter/setter method: getOptimizer -> getSolver > * Add missing setter for "solver" and "stepSize". > * Make GD solver take effect. > * Update docs, annotations and fix other minor issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15269) Creating external table leaves empty directory under warehouse directory
[ https://issues.apache.org/jira/browse/SPARK-15269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-15269: --- Assignee: Xin Wu > Creating external table leaves empty directory under warehouse directory > > > Key: SPARK-15269 > URL: https://issues.apache.org/jira/browse/SPARK-15269 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Xin Wu > > Adding the following test case in {{HiveDDLSuite}} may reproduce this issue: > {code} > test("foo") { > withTempPath { dir => > val path = dir.getCanonicalPath > spark.range(1).write.json(path) > withTable("ddl_test1") { > sql(s"CREATE TABLE ddl_test1 USING json OPTIONS (PATH '$path')") > sql("DROP TABLE ddl_test1") > sql(s"CREATE TABLE ddl_test1 USING json AS SELECT 1 AS a") > } > } > } > {code} > Note that the first {{CREATE TABLE}} command creates an external table since > data source tables are always external when {{PATH}} option is specified. > When executing the second {{CREATE TABLE}} command, which creates a managed > table with the same name, it fails because there's already an unexpected > directory with the same name as the table name in the warehouse directory: > {noformat} > [info] - foo *** FAILED *** (7 seconds, 649 milliseconds) > [info] org.apache.spark.sql.AnalysisException: path > file:/Users/lian/local/src/spark/workspace-b/target/tmp/warehouse-205e25e7-8918-4615-acf1-10e06af7c35c/ddl_test1 > already exists.; > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:88) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > [info] at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:417) > [info] at > org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:231) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > [info] at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > [info] at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > [info] at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > [info] at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > [info] at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > [info] at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) > [info] at org.apache.spark.sql.Dataset.(Dataset.scala:186) > [info] at org.apache.spark.sql.Dataset.(Dataset.scala:167) > [info] at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62) > [info] at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541) > [info] at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:59) > [info] at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:59) > [info] at >
[jira] [Updated] (SPARK-15357) Cooperative spilling should check consumer memory mode
[ https://issues.apache.org/jira/browse/SPARK-15357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-15357: -- Description: In TaskMemoryManager.java: {code} for (MemoryConsumer c: consumers) { if (c != consumer && c.getUsed() > 0) { try { long released = c.spill(required - got, consumer); if (released > 0 && mode == tungstenMemoryMode) { got += memoryManager.acquireExecutionMemory(required - got, taskAttemptId, mode); if (got >= required) { break; } } } catch(...) { ... } } } } {code} Currently, when non-tungsten consumers acquire execution memory, they may force other tungsten consumers to spill and then NOT use the freed memory. A better way to do this is to incorporate the memory mode in the consumer itself and spill only those with matching memory modes. was: In TaskMemoryManager.java: {code} for (MemoryConsumer c: consumers) { if (c != consumer && c.getUsed() > 0) { try { long released = c.spill(required - got, consumer); if (released > 0 && mode == tungstenMemoryMode) { logger.debug("Task {} released {} from {} for {}", taskAttemptId, Utils.bytesToString(released), c, consumer); got += memoryManager.acquireExecutionMemory(required - got, taskAttemptId, mode); if (got >= required) { break; } } } catch (IOException e) { ... } } } {code} Currently, when non-tungsten consumers acquire execution memory, they may force other tungsten consumers to spill and then NOT use the freed memory. A better way to do this is to incorporate the memory mode in the consumer itself and spill only those with matching memory modes. > Cooperative spilling should check consumer memory mode > -- > > Key: SPARK-15357 > URL: https://issues.apache.org/jira/browse/SPARK-15357 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Andrew Or > > In TaskMemoryManager.java: > {code} > for (MemoryConsumer c: consumers) { > if (c != consumer && c.getUsed() > 0) { > try { > long released = c.spill(required - got, consumer); > if (released > 0 && mode == tungstenMemoryMode) { > got += memoryManager.acquireExecutionMemory(required - got, > taskAttemptId, mode); > if (got >= required) { > break; > } > } > } catch(...) { ... } > } > } > } > {code} > Currently, when non-tungsten consumers acquire execution memory, they may > force other tungsten consumers to spill and then NOT use the freed memory. A > better way to do this is to incorporate the memory mode in the consumer > itself and spill only those with matching memory modes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15357) Cooperative spilling should check consumer memory mode
Andrew Or created SPARK-15357: - Summary: Cooperative spilling should check consumer memory mode Key: SPARK-15357 URL: https://issues.apache.org/jira/browse/SPARK-15357 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Andrew Or In TaskMemoryManager.java: {code} for (MemoryConsumer c: consumers) { if (c != consumer && c.getUsed() > 0) { try { long released = c.spill(required - got, consumer); if (released > 0 && mode == tungstenMemoryMode) { logger.debug("Task {} released {} from {} for {}", taskAttemptId, Utils.bytesToString(released), c, consumer); got += memoryManager.acquireExecutionMemory(required - got, taskAttemptId, mode); if (got >= required) { break; } } } catch (IOException e) { ... } } } {code} Currently, when non-tungsten consumers acquire execution memory, they may force other tungsten consumers to spill and then NOT use the freed memory. A better way to do this is to incorporate the memory mode in the consumer itself and spill only those with matching memory modes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14752) LazilyGenerateOrdering throws NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-14752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285675#comment-15285675 ] Apache Spark commented on SPARK-14752: -- User 'bomeng' has created a pull request for this issue: https://github.com/apache/spark/pull/13141 > LazilyGenerateOrdering throws NullPointerException > -- > > Key: SPARK-14752 > URL: https://issues.apache.org/jira/browse/SPARK-14752 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Rajesh Balamohan > > codebase: spark master > DataSet: TPC-DS > Client: $SPARK_HOME/bin/beeline > Example query to reproduce the issue: > select i_item_id from item order by i_item_id limit 10; > Explain plan output > {noformat} > explain select i_item_id from item order by i_item_id limit 10; > +--+--+ > | > plan > > | > +--+--+ > | == Physical Plan == > TakeOrderedAndProject(limit=10, orderBy=[i_item_id#1229 ASC], > output=[i_item_id#1229]) > +- WholeStageCodegen >: +- Project [i_item_id#1229] >: +- Scan HadoopFiles[i_item_id#1229] Format: ORC, PushedFilters: [], > ReadSchema: struct | > +--+--+ > {noformat} > Exception: > {noformat} > TaskResultGetter: Exception while getting task result > com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException > Serialization trace: > underlying (org.apache.spark.util.BoundedPriorityQueue) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:551) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:25) > at com.twitter.chill.SomeSerializer.read(SomeSerializer.scala:19) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:790) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:312) > at > org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:87) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:66) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1791) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:157) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.compare(GenerateOrdering.scala:148) > at scala.math.Ordering$$anon$4.compare(Ordering.scala:111) > at java.util.PriorityQueue.siftUpUsingComparator(PriorityQueue.java:669) > at java.util.PriorityQueue.siftUp(PriorityQueue.java:645) > at java.util.PriorityQueue.offer(PriorityQueue.java:344) > at java.util.PriorityQueue.add(PriorityQueue.java:321) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:78) > at > com.twitter.chill.java.PriorityQueueSerializer.read(PriorityQueueSerializer.java:31) > at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:708) > at >
[jira] [Commented] (SPARK-14817) ML, Graph, R 2.0 QA: Programming guide update and migration guide
[ https://issues.apache.org/jira/browse/SPARK-14817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285626#comment-15285626 ] Joseph K. Bradley commented on SPARK-14817: --- Migration guide needs to note change from [SPARK-14814]'s PR > ML, Graph, R 2.0 QA: Programming guide update and migration guide > - > > Key: SPARK-14817 > URL: https://issues.apache.org/jira/browse/SPARK-14817 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib, SparkR >Reporter: Joseph K. Bradley > > Before the release, we need to update the MLlib, GraphX, and SparkR > Programming Guides. Updates will include: > * Add migration guide subsection. > ** Use the results of the QA audit JIRAs and [SPARK-13448]. > * Check phrasing, especially in main sections (for outdated items such as "In > this release, ...") > For MLlib, we will make the DataFrame-based API (spark.ml) front-and-center, > to make it clear the RDD-based API is the older, maintenance-mode one. > * No docs for spark.mllib will be deleted; they will just be reorganized and > put in a subsection. > * If spark.ml docs are less complete, or if spark.ml docs say "refer to the > spark.mllib docs for details," then we should copy those details to the > spark.ml docs. This per-feature work can happen under [SPARK-14815]. > * This big reorganization should be done *after* docs are added for each > feature (to minimize merge conflicts). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14814) ML 2.0 QA: API: Java compatibility, docs
[ https://issues.apache.org/jira/browse/SPARK-14814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-14814. --- Resolution: Fixed Fix Version/s: 2.0.0 Given your review + the Java fix, I'll mark this as done. Thanks! > ML 2.0 QA: API: Java compatibility, docs > > > Key: SPARK-14814 > URL: https://issues.apache.org/jira/browse/SPARK-14814 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Java API, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: yuhao yang > Fix For: 2.0.0 > > > Check Java compatibility for MLlib for this release. > Checking compatibility means: > * comparing with the Scala doc > * verifying that Java docs are not messed up by Scala type incompatibilities. > Some items to look out for are: > ** Check for generic "Object" types where Java cannot understand complex > Scala types. > *** *Note*: The Java docs do not always match the bytecode. If you find a > problem, please verify it using {{javap}}. > ** Check Scala objects (especially with nesting!) carefully. > ** Check for uses of Scala and Java enumerations, which can show up oddly in > the other language's doc. > * If needed for complex issues, create small Java unit tests which execute > each method. (The correctness can be checked in Scala.) > If you find issues, please comment here, or for larger items, create separate > JIRAs and link here as "requires". > Note that we should not break APIs from previous releases. So if you find a > problem, check if it was introduced in this Spark release (in which case we > can fix it) or in a previous one (in which case we can create a java-friendly > version of the API). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15194) Add Python ML API for MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285618#comment-15285618 ] praveen dareddy commented on SPARK-15194: - [~josephkb] Thanks for clarifying this. I will continue work on this issue once the blocker issue SPARK-14906 is merged to the master. Thanks, praveen > Add Python ML API for MultivariateGaussian > -- > > Key: SPARK-15194 > URL: https://issues.apache.org/jira/browse/SPARK-15194 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Minor > > We have a PySpark API for the MLLib version but not the ML version. This > would allow Python's `GaussianMixture` to more closely match the Scala API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14810) ML, Graph 2.0 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-14810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285613#comment-15285613 ] Joseph K. Bradley commented on SPARK-14810: --- [~nick.pentre...@gmail.com] Thanks! Your judgements sound correct to me. To document the changes, I like to list them in the migration guide, grouped by whether they are breaking changes, removed deprecated items, behavior changes, etc. By the way, can you please not put items specific to this release in the JIRA description? It makes things easier if we can clone these QA JIRAs for each new release and minimize the editing needed. Feel free to update the instructions, though. > ML, Graph 2.0 QA: API: Binary incompatible changes > -- > > Key: SPARK-14810 > URL: https://issues.apache.org/jira/browse/SPARK-14810 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Nick Pentreath > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. > List of changes since {{1.6.0}} audited - these are "false positives" due to > being private, @Experimental, DeveloperAPI, etc: > * SPARK-13686 - Add a constructor parameter `regParam` to > (Streaming)LinearRegressionWithSGD > * SPARK-13664 - Replace HadoopFsRelation with FileFormat > * SPARK-11622 - Make LibSVMRelation extends HadoopFsRelation and Add > LibSVMOutputWriter > * SPARK-13920 - MIMA checks should apply to @Experimental and @DeveloperAPI > APIs > * SPARK-11011 - UserDefinedType serialization should be strongly typed > * SPARK-13817 - Re-enable MiMA and removes object DataFrame > * SPARK-13927 - add row/column iterator to local matrices - (add methods to > sealed trait) > * SPARK-13948 - MiMa Check should catch if the visibility change to `private` > - (DataFrame -> Dataset) > * SPARK-11262 - Unit test for gradient, loss layers, memory management - > (private class) > * SPARK-13430 - moved featureCol from LinearRegressionModelSummary to > LinearRegressionSummary - (private class) > * SPARK-13048 - keepLastCheckpoint option for LDA EM optimizer - (private > class) > * SPARK-14734 - Add conversions between mllib and ml Vector, Matrix types - > (private methods added) > * SPARK-14861 - Replace internal usages of SQLContext with SparkSession - > (private class) > Binary incompatible changes: > * SPARK-14089 - Remove methods that has been deprecated since 1.1, 1.2, 1.3, > 1.4, and 1.5 > * SPARK-14952 - Remove methods deprecated in 1.6 > * DataFrame -> Dataset changes for Java (this of course applies for all > of Spark SQL) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7424) spark.ml classification, regression abstractions should add metadata to output column
[ https://issues.apache.org/jira/browse/SPARK-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285608#comment-15285608 ] Joseph K. Bradley commented on SPARK-7424: -- I'm retargeting for 2.1 since we need to focus on QA now. > spark.ml classification, regression abstractions should add metadata to > output column > - > > Key: SPARK-7424 > URL: https://issues.apache.org/jira/browse/SPARK-7424 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Update ClassificationModel, ProbabilisticClassificationModel prediction to > include numClasses in output column metadata. > Update RegressionModel to specify output column metadata as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7424) spark.ml classification, regression abstractions should add metadata to output column
[ https://issues.apache.org/jira/browse/SPARK-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7424: - Target Version/s: 2.1.0 (was: 2.0.0) > spark.ml classification, regression abstractions should add metadata to > output column > - > > Key: SPARK-7424 > URL: https://issues.apache.org/jira/browse/SPARK-7424 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > Update ClassificationModel, ProbabilisticClassificationModel prediction to > include numClasses in output column metadata. > Update RegressionModel to specify output column metadata as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-15356) AOL Customer Care Number @ 1800.545.7482 Help Desk Number & AOL MAIL Tech Support Phone Number
[ https://issues.apache.org/jira/browse/SPARK-15356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin deleted SPARK-15356: --- > AOL Customer Care Number @ 1800.545.7482 Help Desk Number & AOL MAIL Tech > Support Phone Number > > > Key: SPARK-15356 > URL: https://issues.apache.org/jira/browse/SPARK-15356 > Project: Spark > Issue Type: Bug > Environment: AOL Customer Care Number @ 1800.545.7482 Help Desk > Number & AOL MAIL Tech Support Phone Number >Reporter: lola pola > > Support & Service Call -)) 1800 545 7482 )))AOL Tech support phone number > AOL support Phone number %%2$$ AOL customer support phone number > +1800-545-7482 AOL customer service number,1800^545^7482 AOL helpdesk phone > number, AOL customer care number, 1800*545*7482 AOL support phone number, > AOL password recovery phone number, 1800::545::7482 AOL customer care phone > number, AOL customer service number AOL official phone number > 1800**545**7482 @$@$ > +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care > number AOL support phone number 1800 545 7482 AOL customer care number 1800 > 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK > AOL Email helpdesk number AOL Password recovery phone number AOL tech support > number 1800 545 7482 AOL Technical support number > 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number > AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 > AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email > helpdesk number AOL Password recovery phone number AOL tech support number > 1800-545-7482 AOL Technical support number @@/CANADA > +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care > number AOL support phone number 1800 545 7482 AOL customer care number 1800 > 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK > AOL Email helpdesk number AOL Password recovery phone number AOL tech support > number 1800 545 7482 AOL Technical support number > 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number > AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 > AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email > helpdesk number AOL Password recovery phone number AOL tech support number > 1800-545-7482 AOL Technical support number @@/CANADA > +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care > number AOL support phone number 1800 545 7482 AOL customer care number 1800 > 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK > AOL Email helpdesk number AOL Password recovery phone number AOL tech support > number 1800 545 7482 AOL Technical support number > 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number > AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 > AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email > helpdesk number AOL Password recovery phone number AOL tech support number > 1800-545-7482 AOL Technical support number @@/CANADA > +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care > number AOL support phone number 1800 545 7482 AOL customer care number 1800 > 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK > AOL Email helpdesk number AOL Password recovery phone number AOL tech support > number 1800 545 7482 AOL Technical support number > 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number > AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 > AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email > helpdesk number AOL Password recovery phone number AOL tech support number > 1800-545-7482 AOL Technical support number @@/CANADA > +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care > number AOL support phone number 1800 545 7482 AOL customer care number 1800 > 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK > AOL Email helpdesk number AOL Password recovery phone number AOL tech support > number 1800 545 7482 AOL Technical support number > 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number > AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 > AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email > helpdesk number AOL Password recovery phone number AOL tech support number > 1800-545-7482 AOL Technical support number @@/CANADA > +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care >
[jira] [Updated] (SPARK-15328) Word2Vec import for original binary format
[ https://issues.apache.org/jira/browse/SPARK-15328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15328: -- Priority: Minor (was: Major) > Word2Vec import for original binary format > -- > > Key: SPARK-15328 > URL: https://issues.apache.org/jira/browse/SPARK-15328 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yuming Wang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15328) Word2Vec import for original binary format
[ https://issues.apache.org/jira/browse/SPARK-15328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15328: -- Component/s: (was: MLlib) > Word2Vec import for original binary format > -- > > Key: SPARK-15328 > URL: https://issues.apache.org/jira/browse/SPARK-15328 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Yuming Wang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15356) AOL Customer Care Number @ 1800.545.7482 Help Desk Number & AOL MAIL Tech Support Phone Number
lola pola created SPARK-15356: - Summary: AOL Customer Care Number @ 1800.545.7482 Help Desk Number & AOL MAIL Tech Support Phone Number Key: SPARK-15356 URL: https://issues.apache.org/jira/browse/SPARK-15356 Project: Spark Issue Type: Bug Components: Build Affects Versions: 1.6.1 Environment: AOL Customer Care Number @ 1800.545.7482 Help Desk Number & AOL MAIL Tech Support Phone Number Reporter: lola pola Support & Service Call -)) 1800 545 7482 )))AOL Tech support phone number AOL support Phone number %%2$$ AOL customer support phone number +1800-545-7482 AOL customer service number,1800^545^7482 AOL helpdesk phone number, AOL customer care number, 1800*545*7482 AOL support phone number, AOL password recovery phone number, 1800::545::7482 AOL customer care phone number, AOL customer service number AOL official phone number 1800**545**7482 @$@$ +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care number AOL support phone number 1800 545 7482 AOL customer care number 1800 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK AOL Email helpdesk number AOL Password recovery phone number AOL tech support number 1800 545 7482 AOL Technical support number 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email helpdesk number AOL Password recovery phone number AOL tech support number 1800-545-7482 AOL Technical support number @@/CANADA +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care number AOL support phone number 1800 545 7482 AOL customer care number 1800 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK AOL Email helpdesk number AOL Password recovery phone number AOL tech support number 1800 545 7482 AOL Technical support number 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email helpdesk number AOL Password recovery phone number AOL tech support number 1800-545-7482 AOL Technical support number @@/CANADA +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care number AOL support phone number 1800 545 7482 AOL customer care number 1800 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK AOL Email helpdesk number AOL Password recovery phone number AOL tech support number 1800 545 7482 AOL Technical support number 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email helpdesk number AOL Password recovery phone number AOL tech support number 1800-545-7482 AOL Technical support number @@/CANADA +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care number AOL support phone number 1800 545 7482 AOL customer care number 1800 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK AOL Email helpdesk number AOL Password recovery phone number AOL tech support number 1800 545 7482 AOL Technical support number 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email helpdesk number AOL Password recovery phone number AOL tech support number 1800-545-7482 AOL Technical support number @@/CANADA +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care number AOL support phone number 1800 545 7482 AOL customer care number 1800 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK AOL Email helpdesk number AOL Password recovery phone number AOL tech support number 1800 545 7482 AOL Technical support number 1800-545-7482 AOL EMAIL SUPPORT NUMBER 1800-545-7482 AOL customer care number AOL support phone number 1800-545-7482 AOL customer care number 1800-545-7482 AOL helpdesk phone number 1800-545-7482 AOL EMAIL SUPPORT HELPDESK AOL Email helpdesk number AOL Password recovery phone number AOL tech support number 1800-545-7482 AOL Technical support number @@/CANADA +1800 545 7482 AOL EMAIL SUPPORT NUMBER 1800 545 7482 AOL customer care number AOL support phone number 1800 545 7482 AOL customer care number 1800 545 7482 AOL helpdesk phone number 1800 545 7482 AOL EMAIL SUPPORT HELPDESK AOL Email helpdesk number AOL Password recovery phone number AOL tech support number 1800 545 7482 AOL Technical support number
[jira] [Updated] (SPARK-15254) Improve ML pipeline Cross Validation Scaladoc & PyDoc
[ https://issues.apache.org/jira/browse/SPARK-15254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15254: -- Component/s: Documentation > Improve ML pipeline Cross Validation Scaladoc & PyDoc > - > > Key: SPARK-15254 > URL: https://issues.apache.org/jira/browse/SPARK-15254 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: holdenk >Priority: Minor > > The ML pipeline Cross Validation Scaladoc & PyDoc is very spares - we should > fill this out with a more concrete description. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15254) Improve ML pipeline Cross Validation Scaladoc & PyDoc
[ https://issues.apache.org/jira/browse/SPARK-15254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15254: -- Issue Type: Documentation (was: Improvement) > Improve ML pipeline Cross Validation Scaladoc & PyDoc > - > > Key: SPARK-15254 > URL: https://issues.apache.org/jira/browse/SPARK-15254 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: holdenk >Priority: Minor > > The ML pipeline Cross Validation Scaladoc & PyDoc is very spares - we should > fill this out with a more concrete description. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15194) Add Python ML API for MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-15194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285595#comment-15285595 ] Joseph K. Bradley commented on SPARK-15194: --- This should be implemented using numpy, within mllib-local, as [~holdenk] said. But you'll need to wait until the blocker JIRA is done. > Add Python ML API for MultivariateGaussian > -- > > Key: SPARK-15194 > URL: https://issues.apache.org/jira/browse/SPARK-15194 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: holdenk >Priority: Minor > > We have a PySpark API for the MLLib version but not the ML version. This > would allow Python's `GaussianMixture` to more closely match the Scala API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15164) Mark classification algorithms as experimental where marked so in scala
[ https://issues.apache.org/jira/browse/SPARK-15164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15164: -- Target Version/s: 2.0.0 > Mark classification algorithms as experimental where marked so in scala > --- > > Key: SPARK-15164 > URL: https://issues.apache.org/jira/browse/SPARK-15164 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: holdenk >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15145) port binary classification evaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285577#comment-15285577 ] Joseph K. Bradley commented on SPARK-15145: --- [~wm624] Can you please update this JIRA title and description? (The evaluator already is in spark.ml; this needs to be more specific.) Also, please update the PR. Thanks! > port binary classification evaluator to spark.ml > > > Key: SPARK-15145 > URL: https://issues.apache.org/jira/browse/SPARK-15145 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Miao Wang > > As we discussed in #12922, binary classification evaluator should be ported > from mllib to spark.ml after 2.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15145) port binary classification evaluator to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285577#comment-15285577 ] Joseph K. Bradley edited comment on SPARK-15145 at 5/16/16 10:45 PM: - [~wm624] Can you please update this JIRA title and description? (The evaluator already is in spark.ml; this needs to be more specific.) Also, please update the PR title & description too. Thanks! was (Author: josephkb): [~wm624] Can you please update this JIRA title and description? (The evaluator already is in spark.ml; this needs to be more specific.) Also, please update the PR. Thanks! > port binary classification evaluator to spark.ml > > > Key: SPARK-15145 > URL: https://issues.apache.org/jira/browse/SPARK-15145 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Miao Wang > > As we discussed in #12922, binary classification evaluator should be ported > from mllib to spark.ml after 2.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15355) Pro-active block replenishment in case of node/executor failures
Shubham Chopra created SPARK-15355: -- Summary: Pro-active block replenishment in case of node/executor failures Key: SPARK-15355 URL: https://issues.apache.org/jira/browse/SPARK-15355 Project: Spark Issue Type: Sub-task Components: Block Manager, Spark Core Reporter: Shubham Chopra Spark currently does not replenish lost replicas. For resiliency and high availability, BlockManagerMasterEndpoint can proactively verify whether all cached RDDs have enough replicas, and replenish them, in case they don’t. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15354) Topology aware block replication strategies
Shubham Chopra created SPARK-15354: -- Summary: Topology aware block replication strategies Key: SPARK-15354 URL: https://issues.apache.org/jira/browse/SPARK-15354 Project: Spark Issue Type: Sub-task Components: Mesos, Spark Core, YARN Reporter: Shubham Chopra Implementations of strategies for resilient block replication for different resource managers that replicate the 3-replica strategy used by HDFS, where the first replica is on an executor, the second replica within the same rack as the executor and a third replica on a different rack. The implementation involves providing two pluggable classes, one running in the driver that provides topology information for every host at cluster start and the second prioritizing a list of peer BlockManagerIds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15353) Making peer selection for block replication pluggable
[ https://issues.apache.org/jira/browse/SPARK-15353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shubham Chopra updated SPARK-15353: --- Attachment: BlockManagerSequenceDiagram.png Sequence diagram explaining the various calls between BlockManager and BlockManagerMasterEndpoint for topology aware block replication > Making peer selection for block replication pluggable > - > > Key: SPARK-15353 > URL: https://issues.apache.org/jira/browse/SPARK-15353 > Project: Spark > Issue Type: Sub-task > Components: Block Manager, Spark Core >Reporter: Shubham Chopra > Attachments: BlockManagerSequenceDiagram.png > > > BlockManagers running on executors provide all logistics around block > management. Before a BlockManager can be used, it has to be “initialized”. As > a part of the initialization, BlockManager asks the > BlockManagerMasterEndpoint to give it topology information. The > BlockManagerMasterEndpoint is provided a pluggable interface that can be used > to resolve a hostname to topology. This information is used to decorate the > BlockManagerId. This happens at cluster start and whenever a new executor is > added. > During replication, the BlockManager gets the list of all its peers in the > form of a Seq[BlockManagerId]. We add a pluggable prioritizer that can be > used to prioritize this list of peers based on topology information. Peers > with higher priority occur first in the sequence and the BlockManager tries > to replicate blocks in that order. > There would be default implementations for these pluggable interfaces that > replicate the existing behavior of randomly choosing a peer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285539#comment-15285539 ] Sean Owen commented on SPARK-3785: -- That, and things like YARN labels are indeed a pre-requisite to be able to target work at machines with a GPU. Those are already done. But this is about doing something in Spark to off-load something to a GPU. It doesn't actually require Spark's support any further; already works. > Support off-loading computations to a GPU > - > > Key: SPARK-3785 > URL: https://issues.apache.org/jira/browse/SPARK-3785 > Project: Spark > Issue Type: Brainstorming > Components: MLlib >Reporter: Thomas Darimont >Priority: Minor > > Are there any plans to adding support for off-loading computations to the > GPU, e.g. via an open-cl binding? > http://www.jocl.org/ > https://code.google.com/p/javacl/ > http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15353) Making peer selection for block replication pluggable
Shubham Chopra created SPARK-15353: -- Summary: Making peer selection for block replication pluggable Key: SPARK-15353 URL: https://issues.apache.org/jira/browse/SPARK-15353 Project: Spark Issue Type: Sub-task Components: Block Manager, Spark Core Reporter: Shubham Chopra BlockManagers running on executors provide all logistics around block management. Before a BlockManager can be used, it has to be “initialized”. As a part of the initialization, BlockManager asks the BlockManagerMasterEndpoint to give it topology information. The BlockManagerMasterEndpoint is provided a pluggable interface that can be used to resolve a hostname to topology. This information is used to decorate the BlockManagerId. This happens at cluster start and whenever a new executor is added. During replication, the BlockManager gets the list of all its peers in the form of a Seq[BlockManagerId]. We add a pluggable prioritizer that can be used to prioritize this list of peers based on topology information. Peers with higher priority occur first in the sequence and the BlockManager tries to replicate blocks in that order. There would be default implementations for these pluggable interfaces that replicate the existing behavior of randomly choosing a peer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15352) Topology aware block replication
Shubham Chopra created SPARK-15352: -- Summary: Topology aware block replication Key: SPARK-15352 URL: https://issues.apache.org/jira/browse/SPARK-15352 Project: Spark Issue Type: New Feature Components: Block Manager, Mesos, Spark Core, YARN Reporter: Shubham Chopra With cached RDDs, Spark can be used for online analytics where it is used to respond to online queries. But loss of RDD partitions due to node/executor failures can cause huge delays in such use cases as the data would have to be regenerated. Cached RDDs, even when using multiple replicas per block, are not currently resilient to node failures when multiple executors are started on the same node. Block replication currently chooses a peer at random, and this peer could also exist on the same host. This effort would add topology aware replication to Spark that can be enabled with pluggable strategies. For ease of development/review, this is being broken down to three major work-efforts: 1. Making peer selection for replication pluggable 2. Providing pluggable implementations for providing topology and topology aware replication 3. Pro-active replenishment of lost blocks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15100) Audit: ml.feature
[ https://issues.apache.org/jira/browse/SPARK-15100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285366#comment-15285366 ] Bryan Cutler commented on SPARK-15100: -- I can do a PR to update CountVectorizer and HashingTF > Audit: ml.feature > - > > Key: SPARK-15100 > URL: https://issues.apache.org/jira/browse/SPARK-15100 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley > > Audit this sub-package for new algorithms which do not have corresponding > sections & examples in the user guide. > See parent issue for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe
[ https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285321#comment-15285321 ] Barry Becker commented on SPARK-15230: -- I updated the description so it says distinct instead of describe. I believe there is a separate jira for the problem with describe not handling backquoted columns. > Back quoted column with dot in it fails when running distinct on dataframe > -- > > Key: SPARK-15230 > URL: https://issues.apache.org/jira/browse/SPARK-15230 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 1.6.0 >Reporter: Barry Becker > > When working with a dataframe columns with .'s in them must be backquoted > (``) or the column name will not be found. This works for most dataframe > methods, but I discovered that it does not work for distinct(). > Suppose you have a dataFrame, testDf, with a DoubleType column named > {{pos.NoZero}}. This statememt: > {noformat} > testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ") > {noformat} > will fail with this error: > {noformat} > org.apache.spark.sql.AnalysisException: Cannot resolve column name > "pos.NoZero" among (pos.NoZero); > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319) > at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612) > at > com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe
[ https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Barry Becker updated SPARK-15230: - Description: When working with a dataframe columns with .'s in them must be backquoted (``) or the column name will not be found. This works for most dataframe methods, but I discovered that it does not work for distinct(). Suppose you have a dataFrame, testDf, with a DoubleType column named {{pos.NoZero}}. This statememt: {noformat} testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ") {noformat} will fail with this error: {noformat} org.apache.spark.sql.AnalysisException: Cannot resolve column name "pos.NoZero" among (pos.NoZero); at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) at org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) at org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329) at org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165) at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328) at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348) at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319) at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612) at com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317) {noformat} was: When working with a dataframe columns with .'s in them must be backquoted (``) or the column name will not be found. This works for most dataframe methods, but I discovered that it does not work for describe(). Suppose you have a dataFrame, testDf, with a DoubleType column named {{pos.NoZero}}. This statememt: {noformat} testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ") {noformat} will fail with this error: {noformat} org.apache.spark.sql.AnalysisException: Cannot resolve column name "pos.NoZero" among (pos.NoZero); at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) at org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) at org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329) at org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328) at org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165) at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328) at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348) at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319) at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612) at com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317) {noformat} > Back quoted column with dot in it fails when running distinct on dataframe >
[jira] [Comment Edited] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe
[ https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285302#comment-15285302 ] Bo Meng edited comment on SPARK-15230 at 5/16/16 9:11 PM: -- In the description, {{it does not work for describe()}} should be {{it does not work for distinct()}}, please update the description, thanks. was (Author: bomeng): In the description, `it does not work for describe()` should be `it does not work for distinct()`, please update the description, thanks. > Back quoted column with dot in it fails when running distinct on dataframe > -- > > Key: SPARK-15230 > URL: https://issues.apache.org/jira/browse/SPARK-15230 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 1.6.0 >Reporter: Barry Becker > > When working with a dataframe columns with .'s in them must be backquoted > (``) or the column name will not be found. This works for most dataframe > methods, but I discovered that it does not work for describe(). > Suppose you have a dataFrame, testDf, with a DoubleType column named > {{pos.NoZero}}. This statememt: > {noformat} > testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ") > {noformat} > will fail with this error: > {noformat} > org.apache.spark.sql.AnalysisException: Cannot resolve column name > "pos.NoZero" among (pos.NoZero); > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319) > at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612) > at > com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe
[ https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285302#comment-15285302 ] Bo Meng commented on SPARK-15230: - In the description, `it does not work for describe()` should be `it does not work for distinct()`, please update the description, thanks. > Back quoted column with dot in it fails when running distinct on dataframe > -- > > Key: SPARK-15230 > URL: https://issues.apache.org/jira/browse/SPARK-15230 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 1.6.0 >Reporter: Barry Becker > > When working with a dataframe columns with .'s in them must be backquoted > (``) or the column name will not be found. This works for most dataframe > methods, but I discovered that it does not work for describe(). > Suppose you have a dataFrame, testDf, with a DoubleType column named > {{pos.NoZero}}. This statememt: > {noformat} > testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ") > {noformat} > will fail with this error: > {noformat} > org.apache.spark.sql.AnalysisException: Cannot resolve column name > "pos.NoZero" among (pos.NoZero); > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319) > at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612) > at > com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe
[ https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15230: Assignee: (was: Apache Spark) > Back quoted column with dot in it fails when running distinct on dataframe > -- > > Key: SPARK-15230 > URL: https://issues.apache.org/jira/browse/SPARK-15230 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 1.6.0 >Reporter: Barry Becker > > When working with a dataframe columns with .'s in them must be backquoted > (``) or the column name will not be found. This works for most dataframe > methods, but I discovered that it does not work for describe(). > Suppose you have a dataFrame, testDf, with a DoubleType column named > {{pos.NoZero}}. This statememt: > {noformat} > testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ") > {noformat} > will fail with this error: > {noformat} > org.apache.spark.sql.AnalysisException: Cannot resolve column name > "pos.NoZero" among (pos.NoZero); > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319) > at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612) > at > com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe
[ https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285263#comment-15285263 ] Apache Spark commented on SPARK-15230: -- User 'bomeng' has created a pull request for this issue: https://github.com/apache/spark/pull/13140 > Back quoted column with dot in it fails when running distinct on dataframe > -- > > Key: SPARK-15230 > URL: https://issues.apache.org/jira/browse/SPARK-15230 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 1.6.0 >Reporter: Barry Becker > > When working with a dataframe columns with .'s in them must be backquoted > (``) or the column name will not be found. This works for most dataframe > methods, but I discovered that it does not work for describe(). > Suppose you have a dataFrame, testDf, with a DoubleType column named > {{pos.NoZero}}. This statememt: > {noformat} > testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ") > {noformat} > will fail with this error: > {noformat} > org.apache.spark.sql.AnalysisException: Cannot resolve column name > "pos.NoZero" among (pos.NoZero); > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319) > at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612) > at > com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15230) Back quoted column with dot in it fails when running distinct on dataframe
[ https://issues.apache.org/jira/browse/SPARK-15230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15230: Assignee: Apache Spark > Back quoted column with dot in it fails when running distinct on dataframe > -- > > Key: SPARK-15230 > URL: https://issues.apache.org/jira/browse/SPARK-15230 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 1.6.0 >Reporter: Barry Becker >Assignee: Apache Spark > > When working with a dataframe columns with .'s in them must be backquoted > (``) or the column name will not be found. This works for most dataframe > methods, but I discovered that it does not work for describe(). > Suppose you have a dataFrame, testDf, with a DoubleType column named > {{pos.NoZero}}. This statememt: > {noformat} > testDf.select(new Column("`pos.NoZero`")).distinct().collect().mkString(", ") > {noformat} > will fail with this error: > {noformat} > org.apache.spark.sql.AnalysisException: Cannot resolve column name > "pos.NoZero" among (pos.NoZero); > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) > at > org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:152) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1$$anonfun$40.apply(DataFrame.scala:1329) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1329) > at > org.apache.spark.sql.DataFrame$$anonfun$dropDuplicates$1.apply(DataFrame.scala:1328) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$withPlan(DataFrame.scala:2165) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1328) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1348) > at org.apache.spark.sql.DataFrame.dropDuplicates(DataFrame.scala:1319) > at org.apache.spark.sql.DataFrame.distinct(DataFrame.scala:1612) > at > com.mineset.spark.vizagg.selection.SelectionExpressionSuite$$anonfun$40.apply$mcV$sp(SelectionExpressionSuite.scala:317) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285248#comment-15285248 ] Bill Zhao commented on SPARK-3785: -- Mesos has added the GPU support in 0.29 release. https://issues.apache.org/jira/browse/MESOS-4424 If Spark can use GPU as resource from mesos, it will expedite the GPU computation for Spark. > Support off-loading computations to a GPU > - > > Key: SPARK-3785 > URL: https://issues.apache.org/jira/browse/SPARK-3785 > Project: Spark > Issue Type: Brainstorming > Components: MLlib >Reporter: Thomas Darimont >Priority: Minor > > Are there any plans to adding support for off-loading computations to the > GPU, e.g. via an open-cl binding? > http://www.jocl.org/ > https://code.google.com/p/javacl/ > http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14942) Reduce delay between batch construction and execution
[ https://issues.apache.org/jira/browse/SPARK-14942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-14942. -- Resolution: Fixed Assignee: Liwei Lin Fix Version/s: 2.0.0 > Reduce delay between batch construction and execution > - > > Key: SPARK-14942 > URL: https://issues.apache.org/jira/browse/SPARK-14942 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Liwei Lin >Assignee: Liwei Lin > Fix For: 2.0.0 > > > Currently in {{StreamExecution}}, we first run the batch, then construct the > next: > {code} > if (dataAvailable) runBatch() > constructNextBatch() > {code} > This is good if we run batches ASAP, where data would get processed in the > very next batch: > !https://cloud.githubusercontent.com/assets/15843379/14779964/2786e698-0b0d-11e6-9d2c-bb41513488b2.png! > However, if we run batches at trigger like {{ProcessTime("1 minute")}}, data > - such as y below - may not get processed in the very next batch i.e. batch > 1, but in batch 2: > !https://cloud.githubusercontent.com/assets/15843379/14779818/6f3bb064-0b0c-11e6-9f16-c1ce4897186b.png! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15072) Remove SparkSession.withHiveSupport
[ https://issues.apache.org/jira/browse/SPARK-15072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285112#comment-15285112 ] Nicholas Chammas commented on SPARK-15072: -- Brief note from [~yhuai] on the motivation behind this issue: https://github.com/apache/spark/pull/13069#issuecomment-219516577 > Remove SparkSession.withHiveSupport > --- > > Key: SPARK-15072 > URL: https://issues.apache.org/jira/browse/SPARK-15072 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Sandeep Singh > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15186) Add user guide for Generalized Linear Regression.
[ https://issues.apache.org/jira/browse/SPARK-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15186: Assignee: Seth Hendrickson (was: Apache Spark) > Add user guide for Generalized Linear Regression. > - > > Key: SPARK-15186 > URL: https://issues.apache.org/jira/browse/SPARK-15186 > Project: Spark > Issue Type: New Feature > Components: Documentation, ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > > We should add a user guide for the new GLR interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15186) Add user guide for Generalized Linear Regression.
[ https://issues.apache.org/jira/browse/SPARK-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15186: Assignee: Apache Spark (was: Seth Hendrickson) > Add user guide for Generalized Linear Regression. > - > > Key: SPARK-15186 > URL: https://issues.apache.org/jira/browse/SPARK-15186 > Project: Spark > Issue Type: New Feature > Components: Documentation, ML >Reporter: Seth Hendrickson >Assignee: Apache Spark >Priority: Minor > > We should add a user guide for the new GLR interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15186) Add user guide for Generalized Linear Regression.
[ https://issues.apache.org/jira/browse/SPARK-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285034#comment-15285034 ] Apache Spark commented on SPARK-15186: -- User 'sethah' has created a pull request for this issue: https://github.com/apache/spark/pull/13139 > Add user guide for Generalized Linear Regression. > - > > Key: SPARK-15186 > URL: https://issues.apache.org/jira/browse/SPARK-15186 > Project: Spark > Issue Type: New Feature > Components: Documentation, ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson >Priority: Minor > > We should add a user guide for the new GLR interface. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-15343. Resolution: Not A Problem Closing as "not a problem" since this is an issue with 3rd-party code. > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) >
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284981#comment-15284981 ] Marcelo Vanzin commented on SPARK-15343: bq. at org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) You're using a 3rd-party module developed by Hortonworks to talk to the YARN ATS; they include it as part of their distribution, but I believe it's not yet compatible with Spark 2.0. So you need to follow up with them, since this is not an issue with Spark, or disable that feature. > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException:
[jira] [Assigned] (SPARK-15351) RowEncoder should support array as the external type for ArrayType
[ https://issues.apache.org/jira/browse/SPARK-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15351: Assignee: Apache Spark (was: Wenchen Fan) > RowEncoder should support array as the external type for ArrayType > -- > > Key: SPARK-15351 > URL: https://issues.apache.org/jira/browse/SPARK-15351 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15351) RowEncoder should support array as the external type for ArrayType
[ https://issues.apache.org/jira/browse/SPARK-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284913#comment-15284913 ] Apache Spark commented on SPARK-15351: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/13138 > RowEncoder should support array as the external type for ArrayType > -- > > Key: SPARK-15351 > URL: https://issues.apache.org/jira/browse/SPARK-15351 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15351) RowEncoder should support array as the external type for ArrayType
[ https://issues.apache.org/jira/browse/SPARK-15351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15351: Assignee: Wenchen Fan (was: Apache Spark) > RowEncoder should support array as the external type for ArrayType > -- > > Key: SPARK-15351 > URL: https://issues.apache.org/jira/browse/SPARK-15351 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15347) Problem select empty ORC table
[ https://issues.apache.org/jira/browse/SPARK-15347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284907#comment-15284907 ] Pedro Prado commented on SPARK-15347: - Sorry Sean! my fault! > Problem select empty ORC table > -- > > Key: SPARK-15347 > URL: https://issues.apache.org/jira/browse/SPARK-15347 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: Hadoop 2.7.1.2.4.2.0-258 > Subversion g...@github.com:hortonworks/hadoop.git -r > 13debf893a605e8a88df18a7d8d214f571e05289 > Compiled by jenkins on 2016-04-25T05:46Z > Compiled with protoc 2.5.0 > From source with checksum 2a2d95f05ec6c3ac547ed58cab713ac > This command was run using > /usr/hdp/2.4.2.0-258/hadoop/hadoop-common-2.7.1.2.4.2.0-258.jar >Reporter: Pedro Prado > > Error when I selected empty ORC table > [pprado@hadoop-m ~]$ beeline -u jdbc:hive2:// > WARNING: Use "yarn jar" to launch YARN applications. > Connecting to jdbc:hive2:// > Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258) > Driver: Hive JDBC (version 1.2.1000.2.4.2.0-258) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 1.2.1000.2.4.2.0-258 by Apache Hive > On beeline => create table my_test (id int, name String) stored as orc; > On beeline => select * from my_test; > 16/05/13 18:18:57 [main]: ERROR hdfs.KeyProviderCache: Could not find uri > with key [dfs.encryption.key.provider.uri] to create a keyProvider !! > OK > +-+---+--+ > | my_test.id | my_test.name | > +-+---+--+ > +-+---+--+ > No rows selected (1.227 seconds) > Hive is OK! > Now, when i execute pyspark. > Welcome to > SPARK version 1.6.1 > Using Python version 2.6.6 (r266:84292, Jul 23 2015 15:22:56) > SparkContext available as sc, HiveContext available as sqlContext. > PySpark => sqlContext.sql("select * from my_test") > 16/05/13 18:33:41 INFO ParseDriver: Parsing command: select * from my_test > 16/05/13 18:33:41 INFO ParseDriver: Parse Completed > Traceback (most recent call last): > File "", line 1, in > File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/context.py", line > 580, in sql > return DataFrame(self.ssql_ctx.sql(sqlQuery), self) > File > "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", > line 813, in __call_ > File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/utils.py", line 53, > in deco > raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.IllegalArgumentException: u'orcFileOperator: path > hdfs://hadoop-m.c.sva-0001.internal:8020/apps/hive/warehouse/my_test does not > have valid orc files matching the pattern' > when i create parquet table, it's all right. I do not have problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15351) RowEncoder should support array as the external type for ArrayType
Wenchen Fan created SPARK-15351: --- Summary: RowEncoder should support array as the external type for ArrayType Key: SPARK-15351 URL: https://issues.apache.org/jira/browse/SPARK-15351 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15272) DirectKafkaInputDStream doesn't work with window operation
[ https://issues.apache.org/jira/browse/SPARK-15272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284776#comment-15284776 ] Lubomir Nerad commented on SPARK-15272: --- We can workaround the Kafka part of the issue. But what about the delay scheduling algorithm? Can't the same problem arise if for example some host dies after a TaskSet has been constructed with tasks having it in their preferred locations? > DirectKafkaInputDStream doesn't work with window operation > -- > > Key: SPARK-15272 > URL: https://issues.apache.org/jira/browse/SPARK-15272 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.2 >Reporter: Lubomir Nerad > > Using Kafka direct {{DStream}} with simple window operation like: > {code:java} > kafkaDStream.window(Durations.milliseconds(1), > Durations.milliseconds(1000)); > .print(); > {code} > with 1s batch duration either freezes after several seconds or lags terribly > (depending on cluster mode). > This happens when Kafka brokers are not part of the Spark cluster (they are > on different nodes). The {{KafkaRDD}} still reports them as preferred > locations. This doesn't seem to be problem in non-window scenarios but with > window it conflicts with delay scheduling algorithm implemented in > {{TaskSetManager}}. It either significantly delays (Yarn mode) or completely > drains (Spark mode) resource offers with {{TaskLocality.ANY}} which are > needed to process tasks with these Kafka broker aligned preferred locations. > When delay scheduling algorithm is switched off ({{spark.locality.wait=0}}), > the example works correctly. > I think that the {{KafkaRDD}} shouldn't report preferred locations if the > brokers don't correspond to worker nodes or allow the reporting of preferred > locations to be switched off. Also it would be good if delay scheduling > algorithm didn't drain / delay offers in the case, the tasks have unmatched > preferred locations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks
[ https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15247: Assignee: (was: Apache Spark) > sqlCtx.read.parquet yields at least n_executors * n_cores tasks > --- > > Key: SPARK-15247 > URL: https://issues.apache.org/jira/browse/SPARK-15247 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Johnny W. > > sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even > though this is only 1 very small file > This issue can increase the latency for small jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks
[ https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284771#comment-15284771 ] Apache Spark commented on SPARK-15247: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/13137 > sqlCtx.read.parquet yields at least n_executors * n_cores tasks > --- > > Key: SPARK-15247 > URL: https://issues.apache.org/jira/browse/SPARK-15247 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Johnny W. > > sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even > though this is only 1 very small file > This issue can increase the latency for small jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks
[ https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15247: Assignee: Apache Spark > sqlCtx.read.parquet yields at least n_executors * n_cores tasks > --- > > Key: SPARK-15247 > URL: https://issues.apache.org/jira/browse/SPARK-15247 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Johnny W. >Assignee: Apache Spark > > sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even > though this is only 1 very small file > This issue can increase the latency for small jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks
[ https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284763#comment-15284763 ] Takeshi Yamamuro commented on SPARK-15247: -- I'll make a pr to fix this. > sqlCtx.read.parquet yields at least n_executors * n_cores tasks > --- > > Key: SPARK-15247 > URL: https://issues.apache.org/jira/browse/SPARK-15247 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Johnny W. > > sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even > though this is only 1 very small file > This issue can increase the latency for small jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks
[ https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284486#comment-15284486 ] Takeshi Yamamuro edited comment on SPARK-15247 at 5/16/16 3:56 PM: --- Not yet. Actually, I'm not 100% sure that this issue needs to be fixed. was (Author: maropu): Not yet. Actually, I'm not sure that this issue needs to be fixed. > sqlCtx.read.parquet yields at least n_executors * n_cores tasks > --- > > Key: SPARK-15247 > URL: https://issues.apache.org/jira/browse/SPARK-15247 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Johnny W. > > sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even > though this is only 1 very small file > This issue can increase the latency for small jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15350) Add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite
[ https://issues.apache.org/jira/browse/SPARK-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-15350: --- Priority: Minor (was: Major) > Add unit test function for LogisticRegressionWithLBFGS in > JavaLogisticRegressionSuite > - > > Key: SPARK-15350 > URL: https://issues.apache.org/jira/browse/SPARK-15350 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Weichen Xu >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > Add unit test function for LogisticRegressionWithLBFGS in > JavaLogisticRegressionSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15350) Add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite
[ https://issues.apache.org/jira/browse/SPARK-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15350: Assignee: Apache Spark > Add unit test function for LogisticRegressionWithLBFGS in > JavaLogisticRegressionSuite > - > > Key: SPARK-15350 > URL: https://issues.apache.org/jira/browse/SPARK-15350 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Weichen Xu >Assignee: Apache Spark > Original Estimate: 24h > Remaining Estimate: 24h > > Add unit test function for LogisticRegressionWithLBFGS in > JavaLogisticRegressionSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15350) Add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite
[ https://issues.apache.org/jira/browse/SPARK-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284718#comment-15284718 ] Apache Spark commented on SPARK-15350: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/13136 > Add unit test function for LogisticRegressionWithLBFGS in > JavaLogisticRegressionSuite > - > > Key: SPARK-15350 > URL: https://issues.apache.org/jira/browse/SPARK-15350 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > Add unit test function for LogisticRegressionWithLBFGS in > JavaLogisticRegressionSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15350) Add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite
[ https://issues.apache.org/jira/browse/SPARK-15350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15350: Assignee: (was: Apache Spark) > Add unit test function for LogisticRegressionWithLBFGS in > JavaLogisticRegressionSuite > - > > Key: SPARK-15350 > URL: https://issues.apache.org/jira/browse/SPARK-15350 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > Add unit test function for LogisticRegressionWithLBFGS in > JavaLogisticRegressionSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15350) Add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite
Weichen Xu created SPARK-15350: -- Summary: Add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite Key: SPARK-15350 URL: https://issues.apache.org/jira/browse/SPARK-15350 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.0.0 Reporter: Weichen Xu Add unit test function for LogisticRegressionWithLBFGS in JavaLogisticRegressionSuite. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284698#comment-15284698 ] Ran Haim edited comment on SPARK-15348 at 5/16/16 3:09 PM: --- This means that if I have a transnational table in hive, I cannot use a spark job to update it or even read it in a coherent way. was (Author: ran.h...@optimalplus.com): If I have a transnational table in hive, I cannot use spark job to update it or even read it in a coherent way. > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature >Reporter: Ran Haim > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284698#comment-15284698 ] Ran Haim commented on SPARK-15348: -- If I have a transnational table in hive, I cannot use spark job to update it or even read it in a coherent way. > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature >Reporter: Ran Haim > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15349) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ran Haim closed SPARK-15349. Resolution: Duplicate > Hive ACID > - > > Key: SPARK-15349 > URL: https://issues.apache.org/jira/browse/SPARK-15349 > Project: Spark > Issue Type: New Feature >Reporter: Ran Haim > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15347) Problem select empty ORC table
[ https://issues.apache.org/jira/browse/SPARK-15347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-15347. --- Resolution: Duplicate Fix Version/s: (was: 1.6.0) Please have a look through JIRA first and read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > Problem select empty ORC table > -- > > Key: SPARK-15347 > URL: https://issues.apache.org/jira/browse/SPARK-15347 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.6.1 > Environment: Hadoop 2.7.1.2.4.2.0-258 > Subversion g...@github.com:hortonworks/hadoop.git -r > 13debf893a605e8a88df18a7d8d214f571e05289 > Compiled by jenkins on 2016-04-25T05:46Z > Compiled with protoc 2.5.0 > From source with checksum 2a2d95f05ec6c3ac547ed58cab713ac > This command was run using > /usr/hdp/2.4.2.0-258/hadoop/hadoop-common-2.7.1.2.4.2.0-258.jar >Reporter: Pedro Prado > > Error when I selected empty ORC table > [pprado@hadoop-m ~]$ beeline -u jdbc:hive2:// > WARNING: Use "yarn jar" to launch YARN applications. > Connecting to jdbc:hive2:// > Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258) > Driver: Hive JDBC (version 1.2.1000.2.4.2.0-258) > Transaction isolation: TRANSACTION_REPEATABLE_READ > Beeline version 1.2.1000.2.4.2.0-258 by Apache Hive > On beeline => create table my_test (id int, name String) stored as orc; > On beeline => select * from my_test; > 16/05/13 18:18:57 [main]: ERROR hdfs.KeyProviderCache: Could not find uri > with key [dfs.encryption.key.provider.uri] to create a keyProvider !! > OK > +-+---+--+ > | my_test.id | my_test.name | > +-+---+--+ > +-+---+--+ > No rows selected (1.227 seconds) > Hive is OK! > Now, when i execute pyspark. > Welcome to > SPARK version 1.6.1 > Using Python version 2.6.6 (r266:84292, Jul 23 2015 15:22:56) > SparkContext available as sc, HiveContext available as sqlContext. > PySpark => sqlContext.sql("select * from my_test") > 16/05/13 18:33:41 INFO ParseDriver: Parsing command: select * from my_test > 16/05/13 18:33:41 INFO ParseDriver: Parse Completed > Traceback (most recent call last): > File "", line 1, in > File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/context.py", line > 580, in sql > return DataFrame(self.ssql_ctx.sql(sqlQuery), self) > File > "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", > line 813, in __call_ > File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/utils.py", line 53, > in deco > raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.IllegalArgumentException: u'orcFileOperator: path > hdfs://hadoop-m.c.sva-0001.internal:8020/apps/hive/warehouse/my_test does not > have valid orc files matching the pattern' > when i create parquet table, it's all right. I do not have problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284688#comment-15284688 ] Sean Owen commented on SPARK-15348: --- I suspect that's waay outside the goals of the project and a huge piece of work > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature >Reporter: Ran Haim > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15349) Hive ACID
Ran Haim created SPARK-15349: Summary: Hive ACID Key: SPARK-15349 URL: https://issues.apache.org/jira/browse/SPARK-15349 Project: Spark Issue Type: New Feature Reporter: Ran Haim Spark does not support any feature of hive's transnational tables, you cannot use spark to delete/update a table and it also has problems reading the aggregated data when no compaction was done. Also it seems that compaction is not supported - alter table ... partition COMPACT 'major' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15348) Hive ACID
Ran Haim created SPARK-15348: Summary: Hive ACID Key: SPARK-15348 URL: https://issues.apache.org/jira/browse/SPARK-15348 Project: Spark Issue Type: New Feature Reporter: Ran Haim Spark does not support any feature of hive's transnational tables, you cannot use spark to delete/update a table and it also has problems reading the aggregated data when no compaction was done. Also it seems that compaction is not supported - alter table ... partition COMPACT 'major' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15347) Problem select empty ORC table
Pedro Prado created SPARK-15347: --- Summary: Problem select empty ORC table Key: SPARK-15347 URL: https://issues.apache.org/jira/browse/SPARK-15347 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.1 Environment: Hadoop 2.7.1.2.4.2.0-258 Subversion g...@github.com:hortonworks/hadoop.git -r 13debf893a605e8a88df18a7d8d214f571e05289 Compiled by jenkins on 2016-04-25T05:46Z Compiled with protoc 2.5.0 >From source with checksum 2a2d95f05ec6c3ac547ed58cab713ac This command was run using /usr/hdp/2.4.2.0-258/hadoop/hadoop-common-2.7.1.2.4.2.0-258.jar Reporter: Pedro Prado Fix For: 1.6.0 Error when I selected empty ORC table [pprado@hadoop-m ~]$ beeline -u jdbc:hive2:// WARNING: Use "yarn jar" to launch YARN applications. Connecting to jdbc:hive2:// Connected to: Apache Hive (version 1.2.1000.2.4.2.0-258) Driver: Hive JDBC (version 1.2.1000.2.4.2.0-258) Transaction isolation: TRANSACTION_REPEATABLE_READ Beeline version 1.2.1000.2.4.2.0-258 by Apache Hive On beeline => create table my_test (id int, name String) stored as orc; On beeline => select * from my_test; 16/05/13 18:18:57 [main]: ERROR hdfs.KeyProviderCache: Could not find uri with key [dfs.encryption.key.provider.uri] to create a keyProvider !! OK +-+---+--+ | my_test.id | my_test.name | +-+---+--+ +-+---+--+ No rows selected (1.227 seconds) Hive is OK! Now, when i execute pyspark. Welcome to SPARK version 1.6.1 Using Python version 2.6.6 (r266:84292, Jul 23 2015 15:22:56) SparkContext available as sc, HiveContext available as sqlContext. PySpark => sqlContext.sql("select * from my_test") 16/05/13 18:33:41 INFO ParseDriver: Parsing command: select * from my_test 16/05/13 18:33:41 INFO ParseDriver: Parse Completed Traceback (most recent call last): File "", line 1, in File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/context.py", line 580, in sql return DataFrame(self.ssql_ctx.sql(sqlQuery), self) File "/usr/hdp/2.4.2.0-258/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call_ File "/usr/hdp/2.4.2.0-258/spark/python/pyspark/sql/utils.py", line 53, in deco raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.IllegalArgumentException: u'orcFileOperator: path hdfs://hadoop-m.c.sva-0001.internal:8020/apps/hive/warehouse/my_test does not have valid orc files matching the pattern' when i create parquet table, it's all right. I do not have problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15346) Reduce duplicate computation in picking initial points in LocalKMeans
[ https://issues.apache.org/jira/browse/SPARK-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Abraham Zhan updated SPARK-15346: - Description: h2.Main Issue I found that for KMans|| in mllib, when dataset is in large scale, after initial KMeans|| finishes and before Lloyd's iteration begins, the program will stuck for a long time without terminal. After testing I see it's stucked with LocalKMeans. And there is a to be improved feature in LocalKMeans.scala in Mllib. After picking each new initial centers, it's unnecessary to compute the distances between all the points and the old centers as below {code:scala} val costArray = points.map { point => KMeans.fastSquaredDistance(point, centers(0)) } {code} Instead this we can keep the distance between all the points and their closest centers, and compare to the distance of them with the new center then update them. h2.Test Download [LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip] I provided a attach "LocalKMeans.zip" which contains the code "LocalKMeans.scala" and dataset "bigKMeansMedia" LocalKMeans.scala contains both original version method KMeansPlusPlus and a modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0) I added a tests and main function in it so that any one can run the file directly. h3.How to Test Replacing mllib.clustering.LocalKMeans.scala in your local repository with my LocalKMeans.scala. Modify the path in line 34 (loadAndRun()) with the path you restoring the data file bigKMeansMedia which is also provided in the patch. Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed to clustering number K and iteration number respectively. Then the console will print the cost time and SE of the two version of KMeans++ respectively. h2.Test Results This data is generated from a KMeans|| eperiment in spark, I add some inner function and output the result of KMeans|| initialization and restore. The first line of the file with format "%d:%d:%d:%d" indicates "the seed:feature num:iteration num (in original KMeans||):points num" of the data. In my machine the experiment result is as below: !https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg! the x-axis is the clustering num k while y-axis is the time in seconds was: h2.Main Issue I found the actually reason why GUI does not finish, which turns out that it's stuck with LocalKMeans. And there is a to be improved feature in LocalKMeans.scala in Mllib. After picking each new initial centers, it's unnecessary to compute the distances between all the points and the old centers as below {code:scala} val costArray = points.map { point => KMeans.fastSquaredDistance(point, centers(0)) } {code} Instead this we can keep the distance between all the points and their closest centers, and compare to the distance of them with the new center then update them. h2.Test Download [LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip] I provided a attach "LocalKMeans.zip" which contains the code "LocalKMeans.scala" and dataset "bigKMeansMedia" LocalKMeans.scala contains both original version method KMeansPlusPlus and a modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0) I added a tests and main function in it so that any one can run the file directly. h3.How to Test Replacing mllib.clustering.LocalKMeans.scala in your local repository with my LocalKMeans.scala. Modify the path in line 34 (loadAndRun()) with the path you restoring the data file bigKMeansMedia which is also provided in the patch. Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed to clustering number K and iteration number respectively. Then the console will print the cost time and SE of the two version of KMeans++ respectively. h2.Test Results This data is generated from a KMeans|| eperiment in spark, I add some inner function and output the result of KMeans|| initialization and restore. The first line of the file with format "%d:%d:%d:%d" indicates "the seed:feature num:iteration num (in original KMeans||):points num" of the data. In my machine the experiment result is as below: !https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg! the x-axis is the clustering num k while y-axis is the time in seconds > Reduce duplicate computation in picking initial points in LocalKMeans > - > > Key: SPARK-15346 > URL: https://issues.apache.org/jira/browse/SPARK-15346 > Project: Spark > Issue Type: Improvement > Components: MLlib > Environment: Ubuntu 14.04 >Reporter: Abraham Zhan > Labels: performance > > h2.Main Issue > I
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284623#comment-15284623 ] Sean Owen commented on SPARK-15343: --- SInce you're executing in a cluster, I think perhaps a better and more canonical solution is to build with "-Phadoop-provided" and get the Hadoop dependencies from the cluster? then you're inheriting the version that's consistent with the cluster config. > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at
[jira] [Comment Edited] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284619#comment-15284619 ] Maciej Bryński edited comment on SPARK-15343 at 5/16/16 2:05 PM: - Thanks. I set spark.hadoop.yarn.timeline-service.enabled to false. It's nasty workaround but it works. was (Author: maver1ck): I set spark.hadoop.yarn.timeline-service.enabled to false. It's nasty workaround but it works. > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) >
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284619#comment-15284619 ] Maciej Bryński commented on SPARK-15343: I set spark.hadoop.yarn.timeline-service.enabled to false. It's nasty workaround but it works. > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at
[jira] [Issue Comment Deleted] (SPARK-4924) Factor out code to launch Spark applications into a separate library
[ https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Boesch updated SPARK-4924: -- Comment: was deleted (was: Chiming in here as well: three of us are now asking for commentary / pointers to the following: * What capabilities have been added to the spark api * How do we use them * Any examples / other relevant documentation and/or code Just saying "read the documentation" is not acceptable guidance for how to use these added features.) > Factor out code to launch Spark applications into a separate library > > > Key: SPARK-4924 > URL: https://issues.apache.org/jira/browse/SPARK-4924 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 1.4.0 > > Attachments: spark-launcher.txt > > > One of the questions we run into rather commonly is "how to start a Spark > application from my Java/Scala program?". There currently isn't a good answer > to that: > - Instantiating SparkContext has limitations (e.g., you can only have one > active context at the moment, plus you lose the ability to submit apps in > cluster mode) > - Calling SparkSubmit directly is doable but you lose a lot of the logic > handled by the shell scripts > - Calling the shell script directly is doable, but sort of ugly from an API > point of view. > I think it would be nice to have a small library that handles that for users. > On top of that, this library could be used by Spark itself to replace a lot > of the code in the current shell scripts, which have a lot of duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12154) Upgrade to Jersey 2
[ https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284617#comment-15284617 ] Sean Owen commented on SPARK-12154: --- No, I don't think so - let's keep the discussion in one place on the other JIRA > Upgrade to Jersey 2 > --- > > Key: SPARK-12154 > URL: https://issues.apache.org/jira/browse/SPARK-12154 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Affects Versions: 1.5.2 >Reporter: Matt Cheah >Assignee: Matt Cheah >Priority: Blocker > Fix For: 2.0.0 > > > Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. > Library conflicts for Jersey are difficult to workaround - see discussion on > SPARK-11081. It's easier to upgrade Jersey entirely, but we should target > Spark 2.0 since this may be a break for users who were using Jersey 1 in > their Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284612#comment-15284612 ] Sean Owen commented on SPARK-15343: --- Yes, of course that's the change that caused the behavior you're seeing, but it should be OK for all of Spark's usages. At least, that was the conclusion before, and all of the Spark tests work. > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284538#comment-15284538 ] Sean Owen commented on SPARK-15343: --- No, it's clearly a class needed by YARN and that's where it fails -- have a look at the stack. Yes, YARN certainly is the one using Jersey 1.x and it is in a different namespace. When this came up before I was wondering if we needed to adjust exclusions to allow both into the assembly, but have a look at this: http://apache-spark-developers-list.1001551.n3.nabble.com/spark-2-0-issue-with-yarn-td17440.html#a17448 I think the conclusion was that the thing that needs Jersey isn't a part of Spark? > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at
[jira] [Commented] (SPARK-12154) Upgrade to Jersey 2
[ https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284509#comment-15284509 ] Maciej Bryński commented on SPARK-12154: I think this upgrade break compatibility with YARN. https://issues.apache.org/jira/browse/SPARK-15343 > Upgrade to Jersey 2 > --- > > Key: SPARK-12154 > URL: https://issues.apache.org/jira/browse/SPARK-12154 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Affects Versions: 1.5.2 >Reporter: Matt Cheah >Assignee: Matt Cheah >Priority: Blocker > Fix For: 2.0.0 > > > Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. > Library conflicts for Jersey are difficult to workaround - see discussion on > SPARK-11081. It's easier to upgrade Jersey entirely, but we should target > Spark 2.0 since this may be a break for users who were using Jersey 1 in > their Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284507#comment-15284507 ] Maciej Bryński commented on SPARK-15343: And the likely reason of problem. https://issues.apache.org/jira/browse/SPARK-12154 > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at
[jira] [Comment Edited] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284506#comment-15284506 ] Maciej Bryński edited comment on SPARK-15343 at 5/16/16 1:52 PM: - I think it's too early for that. Exception is thrown on JavaSparkContext initialization. So before connection to YARN. I checked jersey-client-1.19.1.jar and com/sun/jersey/api/client/config/ClientConfig is inside. Maybe we should include both versions of this library? was (Author: maver1ck): I think it's too early for that. Exception is thrown on JavaSparkContext initialization. So before connection to YARN. I checked jersey-client-1.19.1.jar and com/sun/jersey/api/client/config/ClientConfig is inside. Maybe we should include both versions ? > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at >
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284506#comment-15284506 ] Maciej Bryński commented on SPARK-15343: I think it's too early for that. Exception is thrown on JavaSparkContext initialization. So before connection to YARN. I checked jersey-client-1.19.1.jar and com/sun/jersey/api/client/config/ClientConfig is inside. Maybe we should include both versions ? > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at
[jira] [Assigned] (SPARK-15346) Reduce duplicate computation in picking initial points in LocalKMeans
[ https://issues.apache.org/jira/browse/SPARK-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15346: Assignee: Apache Spark > Reduce duplicate computation in picking initial points in LocalKMeans > - > > Key: SPARK-15346 > URL: https://issues.apache.org/jira/browse/SPARK-15346 > Project: Spark > Issue Type: Improvement > Components: MLlib > Environment: Ubuntu 14.04 >Reporter: Abraham Zhan >Assignee: Apache Spark > Labels: performance > > h2.Main Issue > I found the actually reason why GUI does not finish, which turns out that > it's stuck with LocalKMeans. And there is a to be improved feature in > LocalKMeans.scala in Mllib. After picking each new initial centers, it's > unnecessary to compute the distances between all the points and the old > centers as below > {code:scala} > val costArray = points.map { point => > KMeans.fastSquaredDistance(point, centers(0)) > } > {code} > Instead this we can keep the distance between all the points and their > closest centers, and compare to the distance of them with the new center then > update them. > h2.Test > Download > [LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip] > I provided a attach "LocalKMeans.zip" which contains the code > "LocalKMeans.scala" and dataset "bigKMeansMedia" > LocalKMeans.scala contains both original version method KMeansPlusPlus and a > modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0) > I added a tests and main function in it so that any one can run the file > directly. > h3.How to Test > Replacing mllib.clustering.LocalKMeans.scala in your local repository with my > LocalKMeans.scala. > Modify the path in line 34 (loadAndRun()) with the path you restoring the > data file bigKMeansMedia which is also provided in the patch. > Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed > to clustering number K and iteration number respectively. > Then the console will print the cost time and SE of the two version of > KMeans++ respectively. > h2.Test Results > This data is generated from a KMeans|| eperiment in spark, I add some inner > function and output the result of KMeans|| initialization and restore. > The first line of the file with format "%d:%d:%d:%d" indicates "the > seed:feature num:iteration num (in original KMeans||):points num" of the > data. > In my machine the experiment result is as below: > !https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg! > the x-axis is the clustering num k while y-axis is the time in seconds -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15346) Reduce duplicate computation in picking initial points in LocalKMeans
[ https://issues.apache.org/jira/browse/SPARK-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284498#comment-15284498 ] Apache Spark commented on SPARK-15346: -- User 'mouendless' has created a pull request for this issue: https://github.com/apache/spark/pull/13133 > Reduce duplicate computation in picking initial points in LocalKMeans > - > > Key: SPARK-15346 > URL: https://issues.apache.org/jira/browse/SPARK-15346 > Project: Spark > Issue Type: Improvement > Components: MLlib > Environment: Ubuntu 14.04 >Reporter: Abraham Zhan > Labels: performance > > h2.Main Issue > I found the actually reason why GUI does not finish, which turns out that > it's stuck with LocalKMeans. And there is a to be improved feature in > LocalKMeans.scala in Mllib. After picking each new initial centers, it's > unnecessary to compute the distances between all the points and the old > centers as below > {code:scala} > val costArray = points.map { point => > KMeans.fastSquaredDistance(point, centers(0)) > } > {code} > Instead this we can keep the distance between all the points and their > closest centers, and compare to the distance of them with the new center then > update them. > h2.Test > Download > [LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip] > I provided a attach "LocalKMeans.zip" which contains the code > "LocalKMeans.scala" and dataset "bigKMeansMedia" > LocalKMeans.scala contains both original version method KMeansPlusPlus and a > modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0) > I added a tests and main function in it so that any one can run the file > directly. > h3.How to Test > Replacing mllib.clustering.LocalKMeans.scala in your local repository with my > LocalKMeans.scala. > Modify the path in line 34 (loadAndRun()) with the path you restoring the > data file bigKMeansMedia which is also provided in the patch. > Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed > to clustering number K and iteration number respectively. > Then the console will print the cost time and SE of the two version of > KMeans++ respectively. > h2.Test Results > This data is generated from a KMeans|| eperiment in spark, I add some inner > function and output the result of KMeans|| initialization and restore. > The first line of the file with format "%d:%d:%d:%d" indicates "the > seed:feature num:iteration num (in original KMeans||):points num" of the > data. > In my machine the experiment result is as below: > !https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg! > the x-axis is the clustering num k while y-axis is the time in seconds -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15346) Reduce duplicate computation in picking initial points in LocalKMeans
[ https://issues.apache.org/jira/browse/SPARK-15346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15346: Assignee: (was: Apache Spark) > Reduce duplicate computation in picking initial points in LocalKMeans > - > > Key: SPARK-15346 > URL: https://issues.apache.org/jira/browse/SPARK-15346 > Project: Spark > Issue Type: Improvement > Components: MLlib > Environment: Ubuntu 14.04 >Reporter: Abraham Zhan > Labels: performance > > h2.Main Issue > I found the actually reason why GUI does not finish, which turns out that > it's stuck with LocalKMeans. And there is a to be improved feature in > LocalKMeans.scala in Mllib. After picking each new initial centers, it's > unnecessary to compute the distances between all the points and the old > centers as below > {code:scala} > val costArray = points.map { point => > KMeans.fastSquaredDistance(point, centers(0)) > } > {code} > Instead this we can keep the distance between all the points and their > closest centers, and compare to the distance of them with the new center then > update them. > h2.Test > Download > [LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip] > I provided a attach "LocalKMeans.zip" which contains the code > "LocalKMeans.scala" and dataset "bigKMeansMedia" > LocalKMeans.scala contains both original version method KMeansPlusPlus and a > modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0) > I added a tests and main function in it so that any one can run the file > directly. > h3.How to Test > Replacing mllib.clustering.LocalKMeans.scala in your local repository with my > LocalKMeans.scala. > Modify the path in line 34 (loadAndRun()) with the path you restoring the > data file bigKMeansMedia which is also provided in the patch. > Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed > to clustering number K and iteration number respectively. > Then the console will print the cost time and SE of the two version of > KMeans++ respectively. > h2.Test Results > This data is generated from a KMeans|| eperiment in spark, I add some inner > function and output the result of KMeans|| initialization and restore. > The first line of the file with format "%d:%d:%d:%d" indicates "the > seed:feature num:iteration num (in original KMeans||):points num" of the > data. > In my machine the experiment result is as below: > !https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg! > the x-axis is the clustering num k while y-axis is the time in seconds -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15346) Reduce duplicate computation in picking initial points in LocalKMeans
Abraham Zhan created SPARK-15346: Summary: Reduce duplicate computation in picking initial points in LocalKMeans Key: SPARK-15346 URL: https://issues.apache.org/jira/browse/SPARK-15346 Project: Spark Issue Type: Improvement Components: MLlib Environment: Ubuntu 14.04 Reporter: Abraham Zhan h2.Main Issue I found the actually reason why GUI does not finish, which turns out that it's stuck with LocalKMeans. And there is a to be improved feature in LocalKMeans.scala in Mllib. After picking each new initial centers, it's unnecessary to compute the distances between all the points and the old centers as below {code:scala} val costArray = points.map { point => KMeans.fastSquaredDistance(point, centers(0)) } {code} Instead this we can keep the distance between all the points and their closest centers, and compare to the distance of them with the new center then update them. h2.Test Download [LocalKMeans.zip|https://dl.dropboxusercontent.com/u/83207617/LocalKMeans.zip] I provided a attach "LocalKMeans.zip" which contains the code "LocalKMeans.scala" and dataset "bigKMeansMedia" LocalKMeans.scala contains both original version method KMeansPlusPlus and a modified version KMeansPlusPlusModify. (best fit with spark.mllib-1.6.0) I added a tests and main function in it so that any one can run the file directly. h3.How to Test Replacing mllib.clustering.LocalKMeans.scala in your local repository with my LocalKMeans.scala. Modify the path in line 34 (loadAndRun()) with the path you restoring the data file bigKMeansMedia which is also provided in the patch. Tune the 2nd and 3rd parameter in line 34 (loadAndRun()) which are refereed to clustering number K and iteration number respectively. Then the console will print the cost time and SE of the two version of KMeans++ respectively. h2.Test Results This data is generated from a KMeans|| eperiment in spark, I add some inner function and output the result of KMeans|| initialization and restore. The first line of the file with format "%d:%d:%d:%d" indicates "the seed:feature num:iteration num (in original KMeans||):points num" of the data. In my machine the experiment result is as below: !https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg! the x-axis is the clustering num k while y-axis is the time in seconds -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks
[ https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284486#comment-15284486 ] Takeshi Yamamuro commented on SPARK-15247: -- Not yet. Actually, I'm not sure that this issue needs to be fixed. > sqlCtx.read.parquet yields at least n_executors * n_cores tasks > --- > > Key: SPARK-15247 > URL: https://issues.apache.org/jira/browse/SPARK-15247 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Johnny W. > > sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even > though this is only 1 very small file > This issue can increase the latency for small jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4924) Factor out code to launch Spark applications into a separate library
[ https://issues.apache.org/jira/browse/SPARK-4924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284483#comment-15284483 ] Thomas Graves commented on SPARK-4924: -- [~javadba] If you have ideas on improving the documentation please file a jira and point them out or make suggestions. The java api is mentioned in the programming guide: http://spark.apache.org/docs/1.6.0/programming-guide.html#launching-spark-jobs-from-java--scala > Factor out code to launch Spark applications into a separate library > > > Key: SPARK-4924 > URL: https://issues.apache.org/jira/browse/SPARK-4924 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 1.4.0 > > Attachments: spark-launcher.txt > > > One of the questions we run into rather commonly is "how to start a Spark > application from my Java/Scala program?". There currently isn't a good answer > to that: > - Instantiating SparkContext has limitations (e.g., you can only have one > active context at the moment, plus you lose the ability to submit apps in > cluster mode) > - Calling SparkSubmit directly is doable but you lose a lot of the logic > handled by the shell scripts > - Calling the shell script directly is doable, but sort of ugly from an API > point of view. > I think it would be nice to have a small library that handles that for users. > On top of that, this library could be used by Spark itself to replace a lot > of the code in the current shell scripts, which have a lot of duplication. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284480#comment-15284480 ] Sean Owen commented on SPARK-15343: --- Yeah, though in theory that doesn't prevent it from being pulled in by YARN from its own copy. You should have YARN being 'provided' at runtime by the cluster -- not bundled in your app though right? > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284465#comment-15284465 ] Maciej Bryński commented on SPARK-15343: [~srowen] I found that we change version of jersey library from 1.9 (https://github.com/apache/spark/blob/branch-1.6/pom.xml#L182) to 2.22.2 (https://github.com/apache/spark/blob/master/pom.xml#L175). Maybe that's the reason. > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at
[jira] [Commented] (SPARK-14881) pyspark and sparkR shell default log level should match spark-shell/Scala
[ https://issues.apache.org/jira/browse/SPARK-14881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284454#comment-15284454 ] Maciej Bryński commented on SPARK-14881: [~felixcheung] Could you check this ? https://issues.apache.org/jira/browse/SPARK-15344 > pyspark and sparkR shell default log level should match spark-shell/Scala > - > > Key: SPARK-14881 > URL: https://issues.apache.org/jira/browse/SPARK-14881 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Shell, SparkR >Affects Versions: 2.0.0 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Minor > Fix For: 2.0.0 > > > Scala spark-shell defaults to log level WARN. pyspark and sparkR should match > that by default (user can change it later) > # ./bin/spark-shell > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15344) Unable to set default log level for PySpark
[ https://issues.apache.org/jira/browse/SPARK-15344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284449#comment-15284449 ] Sean Owen commented on SPARK-15344: --- I know, but I'm suggesting it's probably more useful to continue or reopen the original issue if it didn't work. > Unable to set default log level for PySpark > --- > > Key: SPARK-15344 > URL: https://issues.apache.org/jira/browse/SPARK-15344 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Minor > > After this patch: > https://github.com/apache/spark/pull/12648 > I'm unable to set default log level for Pyspark. > It's always WARN. > Below setting doesn't work: > {code} > mbrynski@jupyter:~/spark$ cat conf/log4j.properties > # Set everything to be logged to the console > log4j.rootCategory=INFO, console > log4j.appender.console=org.apache.log4j.ConsoleAppender > log4j.appender.console.target=System.err > log4j.appender.console.layout=org.apache.log4j.PatternLayout > log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p > %c{1}: %m%n > # Set the default spark-shell log level to WARN. When running the > spark-shell, the > # log level for this class is used to overwrite the root logger's log level, > so that > # the user can have different defaults for the shell and regular Spark apps. > log4j.logger.org.apache.spark.repl.Main=INFO > # Settings to quiet third party logs that are too verbose > log4j.logger.org.spark_project.jetty=WARN > log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR > log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO > log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO > log4j.logger.org.apache.parquet=ERROR > log4j.logger.parquet=ERROR > # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent > UDFs in SparkSQL with Hive support > log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL > log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15344) Unable to set default log level for PySpark
[ https://issues.apache.org/jira/browse/SPARK-15344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284442#comment-15284442 ] Maciej Bryński edited comment on SPARK-15344 at 5/16/16 12:44 PM: -- Yep. I mentioned PR from this Jira in description. was (Author: maver1ck): Yep. I mention PR from this Jira in description. > Unable to set default log level for PySpark > --- > > Key: SPARK-15344 > URL: https://issues.apache.org/jira/browse/SPARK-15344 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Minor > > After this patch: > https://github.com/apache/spark/pull/12648 > I'm unable to set default log level for Pyspark. > It's always WARN. > Below setting doesn't work: > {code} > mbrynski@jupyter:~/spark$ cat conf/log4j.properties > # Set everything to be logged to the console > log4j.rootCategory=INFO, console > log4j.appender.console=org.apache.log4j.ConsoleAppender > log4j.appender.console.target=System.err > log4j.appender.console.layout=org.apache.log4j.PatternLayout > log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p > %c{1}: %m%n > # Set the default spark-shell log level to WARN. When running the > spark-shell, the > # log level for this class is used to overwrite the root logger's log level, > so that > # the user can have different defaults for the shell and regular Spark apps. > log4j.logger.org.apache.spark.repl.Main=INFO > # Settings to quiet third party logs that are too verbose > log4j.logger.org.spark_project.jetty=WARN > log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR > log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO > log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO > log4j.logger.org.apache.parquet=ERROR > log4j.logger.parquet=ERROR > # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent > UDFs in SparkSQL with Hive support > log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL > log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15344) Unable to set default log level for PySpark
[ https://issues.apache.org/jira/browse/SPARK-15344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284442#comment-15284442 ] Maciej Bryński commented on SPARK-15344: Yep. I mention PR from this Jira in description. > Unable to set default log level for PySpark > --- > > Key: SPARK-15344 > URL: https://issues.apache.org/jira/browse/SPARK-15344 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Minor > > After this patch: > https://github.com/apache/spark/pull/12648 > I'm unable to set default log level for Pyspark. > It's always WARN. > Below setting doesn't work: > {code} > mbrynski@jupyter:~/spark$ cat conf/log4j.properties > # Set everything to be logged to the console > log4j.rootCategory=INFO, console > log4j.appender.console=org.apache.log4j.ConsoleAppender > log4j.appender.console.target=System.err > log4j.appender.console.layout=org.apache.log4j.PatternLayout > log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p > %c{1}: %m%n > # Set the default spark-shell log level to WARN. When running the > spark-shell, the > # log level for this class is used to overwrite the root logger's log level, > so that > # the user can have different defaults for the shell and regular Spark apps. > log4j.logger.org.apache.spark.repl.Main=INFO > # Settings to quiet third party logs that are too verbose > log4j.logger.org.spark_project.jetty=WARN > log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR > log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO > log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO > log4j.logger.org.apache.parquet=ERROR > log4j.logger.parquet=ERROR > # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent > UDFs in SparkSQL with Hive support > log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL > log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15344) Unable to set default log level for PySpark
[ https://issues.apache.org/jira/browse/SPARK-15344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284438#comment-15284438 ] Sean Owen commented on SPARK-15344: --- Comment on SPARK-14881 then maybe? this sounds like a duplicate or wholly related. CC [~felixcheung] > Unable to set default log level for PySpark > --- > > Key: SPARK-15344 > URL: https://issues.apache.org/jira/browse/SPARK-15344 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Minor > > After this patch: > https://github.com/apache/spark/pull/12648 > I'm unable to set default log level for Pyspark. > It's always WARN. > Below setting doesn't work: > {code} > mbrynski@jupyter:~/spark$ cat conf/log4j.properties > # Set everything to be logged to the console > log4j.rootCategory=INFO, console > log4j.appender.console=org.apache.log4j.ConsoleAppender > log4j.appender.console.target=System.err > log4j.appender.console.layout=org.apache.log4j.PatternLayout > log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p > %c{1}: %m%n > # Set the default spark-shell log level to WARN. When running the > spark-shell, the > # log level for this class is used to overwrite the root logger's log level, > so that > # the user can have different defaults for the shell and regular Spark apps. > log4j.logger.org.apache.spark.repl.Main=INFO > # Settings to quiet third party logs that are too verbose > log4j.logger.org.spark_project.jetty=WARN > log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR > log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO > log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO > log4j.logger.org.apache.parquet=ERROR > log4j.logger.parquet=ERROR > # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent > UDFs in SparkSQL with Hive support > log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL > log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15345) Cannot connect to Hive databases
Piotr Milanowski created SPARK-15345: Summary: Cannot connect to Hive databases Key: SPARK-15345 URL: https://issues.apache.org/jira/browse/SPARK-15345 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Piotr Milanowski I am working with branch-2.0, spark is compiled with hive support (-Phive and -Phvie-thriftserver). I am trying to access databases using this snippet: {code} from pyspark.sql import HiveContext hc = HiveContext(sc) hc.sql("show databases").collect() [Row(result='default')] {code} This means that spark doesn't find any databases specified in configuration. Using the same configuration (i.e. hive-site.xml and core-site.xml) in spark 1.6, and launching above snippet, I can print out existing databases. When run in DEBUG mode this is what spark (2.0) prints out: {code} 16/05/16 12:17:47 INFO SparkSqlParser: Parsing command: show databases 16/05/16 12:17:47 DEBUG SimpleAnalyzer: === Result of Batch Resolution === !'Project [unresolveddeserializer(createexternalrow(if (isnull(input[0, string])) null else input[0, string].toString, StructField(result,StringType,false)), result#2) AS #3] Project [createexternalrow(if (isnull(result#2)) null else result#2.toString, StructField(result,StringType,false)) AS #3] +- LocalRelation [result#2] +- LocalRelation [result#2] 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure (org.apache.spark.sql.Dataset$$anonfun$53) +++ 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 2 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long org.apache.spark.sql.Dataset$$anonfun$53.serialVersionUID 16/05/16 12:17:47 DEBUG ClosureCleaner: private final org.apache.spark.sql.types.StructType org.apache.spark.sql.Dataset$$anonfun$53.structType$1 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object org.apache.spark.sql.Dataset$$anonfun$53.apply(java.lang.Object) 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object org.apache.spark.sql.Dataset$$anonfun$53.apply(org.apache.spark.sql.catalyst.InternalRow) 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because this is the starting closure 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting closure: 0 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure (org.apache.spark.sql.Dataset$$anonfun$53) is now cleaned +++ 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1) +++ 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 1 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.serialVersionUID 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared methods: 2 16/05/16 12:17:47 DEBUG ClosureCleaner: public final java.lang.Object org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(java.lang.Object) 16/05/16 12:17:47 DEBUG ClosureCleaner: public final org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1.apply(scala.collection.Iterator) 16/05/16 12:17:47 DEBUG ClosureCleaner: + inner classes: 0 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer classes: 0 16/05/16 12:17:47 DEBUG ClosureCleaner: + outer objects: 0 16/05/16 12:17:47 DEBUG ClosureCleaner: + populating accessed fields because this is the starting closure 16/05/16 12:17:47 DEBUG ClosureCleaner: + fields accessed by starting closure: 0 16/05/16 12:17:47 DEBUG ClosureCleaner: + there are no enclosing objects! 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ closure (org.apache.spark.sql.execution.python.EvaluatePython$$anonfun$javaToPython$1) is now cleaned +++ 16/05/16 12:17:47 DEBUG ClosureCleaner: +++ Cleaning closure (org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13) +++ 16/05/16 12:17:47 DEBUG ClosureCleaner: + declared fields: 2 16/05/16 12:17:47 DEBUG ClosureCleaner: public static final long org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.serialVersionUID 16/05/16 12:17:47 DEBUG ClosureCleaner: private final org.apache.spark.rdd.RDD$$anonfun$collect$1 org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$13.$outer 16/05/16 12:17:47 DEBUG
[jira] [Commented] (SPARK-15343) NoClassDefFoundError when initializing Spark with YARN
[ https://issues.apache.org/jira/browse/SPARK-15343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284433#comment-15284433 ] Maciej Bryński commented on SPARK-15343: CC: [~vanzin] > NoClassDefFoundError when initializing Spark with YARN > -- > > Key: SPARK-15343 > URL: https://issues.apache.org/jira/browse/SPARK-15343 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0 >Reporter: Maciej Bryński >Priority: Critical > > I'm trying to connect Spark 2.0 (compiled from branch-2.0) with Hadoop. > Spark compiled with: > {code} > ./dev/make-distribution.sh -Pyarn -Phadoop-2.6 -Phive -Phive-thriftserver > -Dhadoop.version=2.6.0 -DskipTests > {code} > I'm getting following error > {code} > mbrynski@jupyter:~/spark$ bin/pyspark > Python 3.4.0 (default, Apr 11 2014, 13:05:11) > [GCC 4.8.2] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Master yarn-client is deprecated since 2.0. Please use master "yarn" > with specified deploy mode instead. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). > 16/05/16 11:54:41 WARN SparkConf: The configuration key 'spark.yarn.jar' has > been deprecated as of Spark 2.0 and may be removed in the future. Please use > the new key 'spark.yarn.jars' instead. > 16/05/16 11:54:41 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/05/16 11:54:42 WARN AbstractHandler: No Server set for > org.spark_project.jetty.server.handler.ErrorHandler@f7989f6 > 16/05/16 11:54:43 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > Traceback (most recent call last): > File "/home/mbrynski/spark/python/pyspark/shell.py", line 38, in > sc = SparkContext() > File "/home/mbrynski/spark/python/pyspark/context.py", line 115, in __init__ > conf, jsc, profiler_cls) > File "/home/mbrynski/spark/python/pyspark/context.py", line 172, in _do_init > self._jsc = jsc or self._initialize_context(self._conf._jconf) > File "/home/mbrynski/spark/python/pyspark/context.py", line 235, in > _initialize_context > return self._jvm.JavaSparkContext(jconf) > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 1183, in __call__ > File > "/home/mbrynski/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line > 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling > None.org.apache.spark.api.java.JavaSparkContext. > : java.lang.NoClassDefFoundError: > com/sun/jersey/api/client/config/ClientConfig > at > org.apache.hadoop.yarn.client.api.TimelineClient.createTimelineClient(TimelineClient.java:45) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceInit(YarnClientImpl.java:163) > at > org.apache.hadoop.service.AbstractService.init(AbstractService.java:163) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:150) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:148) > at org.apache.spark.SparkContext.(SparkContext.scala:502) > at > org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:240) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:236) > at > py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) > at > py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassNotFoundException: > com.sun.jersey.api.client.config.ClientConfig > at java.net.URLClassLoader.findClass(URLClassLoader.java:381) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 19 more > {code} > On 1.6 everything
[jira] [Commented] (SPARK-15247) sqlCtx.read.parquet yields at least n_executors * n_cores tasks
[ https://issues.apache.org/jira/browse/SPARK-15247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284432#comment-15284432 ] Sean Owen commented on SPARK-15247: --- Did you actually open a PR for this? > sqlCtx.read.parquet yields at least n_executors * n_cores tasks > --- > > Key: SPARK-15247 > URL: https://issues.apache.org/jira/browse/SPARK-15247 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Johnny W. > > sqlCtx.read.parquet always yields at least n_executors * n_cores tasks, even > though this is only 1 very small file > This issue can increase the latency for small jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org