[jira] [Created] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas
Leandro Ferrado created SPARK-11758: --- Summary: Missing Index column while creating a DataFrame from Pandas Key: SPARK-11758 URL: https://issues.apache.org/jira/browse/SPARK-11758 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.5.1 Environment: Linux Debian, PySpark, in local testing. Reporter: Leandro Ferrado Priority: Minor In PySpark's SQLContext, when it invokes createDataFrame() from a pandas.DataFrame and indicating a 'schema' with StructFields, the function _createFromLocal() converts the pandas.DataFrame but ignoring two points: - Index column, because the flag index=False - Timestamp's records, because a Date column can't be index and Pandas doesn't converts its records in Timestamp's type. So, converting a DataFrame from Pandas to SQL is poor in scenarios with temporal records. Doc: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html Affected code: def _createFromLocal(self, data, schema): """ Create an RDD for DataFrame from an list or pandas.DataFrame, returns the RDD and schema. """ if has_pandas and isinstance(data, pandas.DataFrame): if schema is None: schema = [str(x) for x in data.columns] data = [r.tolist() for r in data.to_records(index=False)] # HERE # ... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11759) Spark task on mesos with docker fails with sh: 1: /opt/spark/bin/spark-class: not found
Luis Alves created SPARK-11759: -- Summary: Spark task on mesos with docker fails with sh: 1: /opt/spark/bin/spark-class: not found Key: SPARK-11759 URL: https://issues.apache.org/jira/browse/SPARK-11759 Project: Spark Issue Type: Question Reporter: Luis Alves I'm using Spark 1.5.1 and Mesos 0.25 in cluster mode. I've the spark-dispatcher running, and run spark-submit. The driver is launched, but it fails because it seems that the task it launches fails. In the logs of the launched task I can see the following error: sh: 1: /opt/spark/bin/spark-class: not found I checked my docker image and the /opt/spark/bin/spark-class exists. I then noticed that it's using sh, therefore I tried to run (in the docker image) the following: sh /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master It fails with the following error: spark-class: 73: spark-class: Syntax error: "(" unexpected Is this an error in Spark? Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11202) Unsupported dataType
[ https://issues.apache.org/jira/browse/SPARK-11202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006897#comment-15006897 ] F Jimenez commented on SPARK-11202: --- I have noticed the following commit that may have solved the problem https://github.com/apache/spark/commit/02149ff08eed3745086589a047adbce9a580389f > Unsupported dataType > > > Key: SPARK-11202 > URL: https://issues.apache.org/jira/browse/SPARK-11202 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: whc > > I read data from oracle and save as parquet ,then get the following error: > java.lang.IllegalArgumentException: Unsupported dataType: > {"type":"struct","fields":[{"name":"DOMAIN_NAME","type":"string","nullable":true,"metadata":{"name":"DOMAIN_NAME"}},{"name":"DOMAIN_ID","type":"decimal(0,-127)","nullable":true,"metadata":{"name":"DOMAIN_ID"}}]}, > [1.1] failure: `TimestampType' expected but `{' found > {"type":"struct","fields":[{"name":"DOMAIN_NAME","type":"string","nullable":true,"metadata":{"name":"DOMAIN_NAME"}},{"name":"DOMAIN_ID","type":"decimal(0,-127)","nullable":true,"metadata":{"name":"DOMAIN_ID"}}]} > ^ > at > org.apache.spark.sql.types.DataType$CaseClassStringParser$.apply(DataType.scala:245) > at > org.apache.spark.sql.types.DataType$.fromCaseClassString(DataType.scala:102) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$$anonfun$3.apply(ParquetTypesConverter.scala:62) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$$anonfun$3.apply(ParquetTypesConverter.scala:62) > at scala.util.Try.getOrElse(Try.scala:77) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromString(ParquetTypesConverter.scala:62) > at > org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:51) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:234) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > I checked the type but do not have Timestamp or Date type in oracle > my oracle table like this: > create table DW_DOMAIN > ( > domain_id NUMBER, > cityid NUMBER, > domain_type NUMBER, > domain_name VARCHAR2(80) > ) > and my code like this: > Mapoptions = new HashMap (); > options.put("url", url); > options.put("driver", driver); > options.put("user", user); > options.put("password", password); > options.put("dbtable", "(select DOMAIN_NAME,DOMAIN_ID from > dw_domain ) t"); > DataFrame df = this.sqlContext.read().format("jdbc").options(options ) > .load(); > df.write().mode(SaveMode.Append) > > .parquet("hdfs://cluster1:8020/database/count_domain/"); > if add "to_char(DOMAIN_ID)",that can get correct result. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11752) fix timezone problem for DateTimeUtils.getSeconds
[ https://issues.apache.org/jira/browse/SPARK-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-11752: --- Fix Version/s: (was: 1.5.2) 1.5.3 > fix timezone problem for DateTimeUtils.getSeconds > - > > Key: SPARK-11752 > URL: https://issues.apache.org/jira/browse/SPARK-11752 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > Fix For: 1.5.3, 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11665) Support other distance metrics for bisecting k-means
[ https://issues.apache.org/jira/browse/SPARK-11665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006793#comment-15006793 ] Jun Zheng commented on SPARK-11665: --- If no one else is interested, can you assign to me? > Support other distance metrics for bisecting k-means > > > Key: SPARK-11665 > URL: https://issues.apache.org/jira/browse/SPARK-11665 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Yu Ishikawa >Priority: Minor > > Some guys reqested me to support other distance metrics, such as cosine > distance, tanimoto distance, in bisecting k-means. > We should > - desing the interfaces for distance metrics > - support the distances -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11572) Exit AsynchronousListenerBus thread when stop() is called
[ https://issues.apache.org/jira/browse/SPARK-11572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved SPARK-11572. Resolution: Won't Fix > Exit AsynchronousListenerBus thread when stop() is called > - > > Key: SPARK-11572 > URL: https://issues.apache.org/jira/browse/SPARK-11572 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Ted Yu > > As vonnagy reported in the following thread: > http://search-hadoop.com/m/q3RTtk982kvIow22 > Attempts to join the thread in AsynchronousListenerBus resulted in lock up > because AsynchronousListenerBus thread was still getting messages > SparkListenerExecutorMetricsUpdate from the DAGScheduler > Proposed fix is to check stopped flag within the loop of > AsynchronousListenerBus thread -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11522) input_file_name() returns "" for external tables
[ https://issues.apache.org/jira/browse/SPARK-11522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-11522. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9542 [https://github.com/apache/spark/pull/9542] > input_file_name() returns "" for external tables > > > Key: SPARK-11522 > URL: https://issues.apache.org/jira/browse/SPARK-11522 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Simeon Simeonov > Labels: external-tables, hive, sql > Fix For: 1.6.0 > > > Given an external table definition where the data consists of many CSV files, > {{input_file_name()}} returns empty strings. > Table definition: > {code} > CREATE EXTERNAL TABLE external_test(page_id INT, impressions INT) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' > WITH SERDEPROPERTIES ( >"separatorChar" = ",", >"quoteChar" = "\"", >"escapeChar"= "\\" > ) > LOCATION 'file:///Users/sim/spark/test/external_test' > {code} > Query: > {code} > sql("SELECT input_file_name() as file FROM external_test").show > {code} > Output: > {code} > ++ > |file| > ++ > || > || > ... > || > ++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11743) Add UserDefinedType support to RowEncoder
[ https://issues.apache.org/jira/browse/SPARK-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-11743. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9712 [https://github.com/apache/spark/pull/9712] > Add UserDefinedType support to RowEncoder > - > > Key: SPARK-11743 > URL: https://issues.apache.org/jira/browse/SPARK-11743 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > Fix For: 1.6.0 > > > RowEncoder doesn't support UserDefinedType now. We should add the support for > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10935) Avito Context Ad Clicks
[ https://issues.apache.org/jira/browse/SPARK-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006824#comment-15006824 ] Kristina Plazonic commented on SPARK-10935: --- [~xusen] Thanks for pinging. Yes, I think I resolved it - I disabled Tungsten and it made the error go away (in a smaller case). However, I think I'm being super inefficient when generating the features for this problem - because of all the joins. Do you have any pointers on that? [~mengxr], I think it would really help data scientists to have a small document - guide for feature assembly in Spark - what to do and what not to do when using joins, especially if using ML i.e. DataFrames. I spent an inordinate amount of time on that, and I'm still confused!!! For example, should I use DataFrames at all when doing joins? Is it better to use RDDs, because you can partition RDDs by keys, but not DataFrames (e.g. in this example every join is by UserID, and you have 4 million users, so if you had partitioned dataframes by UserID, every join would be local)? Another example, when I started seeing the memory errors with joins, I started asking myself if a whole DataFrame passed into a function is included in a closure of function and a copy shipped off with every task, or does Spark take account of the fact that whatever is passed as an argument of a function is a distributed object and only a reference to every partition of the object is passed in? I still don't really know for sure. All examples on the Spark website and docs and even books are for scripts, not functions with RDD or DataFrame arguments. Thanks for any insights... > Avito Context Ad Clicks > --- > > Key: SPARK-10935 > URL: https://issues.apache.org/jira/browse/SPARK-10935 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng > > From [~kpl...@gmail.com]: > I would love to do Avito Context Ad Clicks - > https://www.kaggle.com/c/avito-context-ad-clicks - but it involves a lot of > feature engineering and preprocessing. I would love to split this with > somebody else if anybody is interested on working with this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11752) fix timezone problem for DateTimeUtils.getSeconds
[ https://issues.apache.org/jira/browse/SPARK-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-11752. Resolution: Fixed Fix Version/s: 1.5.2 1.6.0 Issue resolved by pull request 9728 [https://github.com/apache/spark/pull/9728] > fix timezone problem for DateTimeUtils.getSeconds > - > > Key: SPARK-11752 > URL: https://issues.apache.org/jira/browse/SPARK-11752 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan > Fix For: 1.6.0, 1.5.2 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8332) NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
[ https://issues.apache.org/jira/browse/SPARK-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006828#comment-15006828 ] Pedro Vilaça commented on SPARK-8332: - We're facing the same problem with a spark streaming job and I noticed that this issue was closed. Don't you have a plan to upgrade the jackson version that is being used? > NoSuchMethodError: > com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer > -- > > Key: SPARK-8332 > URL: https://issues.apache.org/jira/browse/SPARK-8332 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 > Environment: spark 1.4 & hadoop 2.3.0-cdh5.0.0 >Reporter: Tao Li >Priority: Critical > Labels: 1.4.0, NoSuchMethodError, com.fasterxml.jackson > > I complied new spark 1.4.0 version. > But when I run a simple WordCount demo, it throws NoSuchMethodError > {code} > java.lang.NoSuchMethodError: > com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer > {code} > I found out that the default "fasterxml.jackson.version" is 2.4.4. > Is there any wrong or conflict with the jackson version? > Or is there possibly some project maven dependency containing the wrong > version of jackson? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11522) input_file_name() returns "" for external tables
[ https://issues.apache.org/jira/browse/SPARK-11522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-11522: - Assignee: Xin Wu > input_file_name() returns "" for external tables > > > Key: SPARK-11522 > URL: https://issues.apache.org/jira/browse/SPARK-11522 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Simeon Simeonov >Assignee: Xin Wu > Labels: external-tables, hive, sql > Fix For: 1.6.0 > > > Given an external table definition where the data consists of many CSV files, > {{input_file_name()}} returns empty strings. > Table definition: > {code} > CREATE EXTERNAL TABLE external_test(page_id INT, impressions INT) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' > WITH SERDEPROPERTIES ( >"separatorChar" = ",", >"quoteChar" = "\"", >"escapeChar"= "\\" > ) > LOCATION 'file:///Users/sim/spark/test/external_test' > {code} > Query: > {code} > sql("SELECT input_file_name() as file FROM external_test").show > {code} > Output: > {code} > ++ > |file| > ++ > || > || > ... > || > ++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11700) Memory leak at SparkContext jobProgressListener stageIdToData map
[ https://issues.apache.org/jira/browse/SPARK-11700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006895#comment-15006895 ] Kostas papageorgopoulos commented on SPARK-11700: - One workaround to minimize the effect is to Keep the JavaSparkContext forever alive. (Never stop it inside a JVM process that is long running ) and configure the following options {code} spark.ui.retainedJobs 1000How many jobs the Spark UI and status APIs remember before garbage collecting. spark.ui.retainedStages 1000 How many stages the Spark UI and status APIs remember before garbage collecting. spark.worker.ui.retainedExecutors 1000How many finished executors the Spark UI and status APIs remember before garbage collecting. spark.worker.ui.retainedDrivers 1000How many finished drivers the Spark UI and status APIs remember before garbage collecting. spark.sql.ui.retainedExecutions 1000How many finished executions the Spark UI and status APIs remember before garbage collecting. spark.streaming.ui.retainedBatches 1000How many finished batches the Spark UI and status APIs remember before garbage collecting.{code} to very small numbers in order to have the {code}JobProgressListener{code} relevant maps cleaned. > Memory leak at SparkContext jobProgressListener stageIdToData map > - > > Key: SPARK-11700 > URL: https://issues.apache.org/jira/browse/SPARK-11700 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: Ubuntu 14.04 LTS, Oracle JDK 1.8.51 Apache tomcat > 8.0.28. Spring 4 >Reporter: Kostas papageorgopoulos >Priority: Minor > Labels: leak, memory-leak > Attachments: AbstractSparkJobRunner.java, > SparkContextPossibleMemoryLeakIDEA_DEBUG.png, SparkHeapSpaceProgress.png, > SparkMemoryAfterLotsOfConsecutiveRuns.png, > SparkMemoryLeakAfterLotsOfRunsWithinTheSameContext.png > > > it seems that there is A SparkContext jobProgressListener memory leak.*. > Bellow i describe the steps i do to reproduce that. > I have created a java webapp trying to abstractly Run some Spark Sql jobs > that read data from HDFS (join them) and Write them To ElasticSearch using ES > hadoop connector. After a Lot of consecutive runs i noticed that my heap > space was full so i got an out of heap space error. > At the attached file {code} AbstractSparkJobRunner {code} the {code} public > final void run(T jobConfiguration, ExecutionLog executionLog) throws > Exception {code} runs each time an Spark Sql Job is triggered. So tried to > reuse the same SparkContext for a number of consecutive runs. If some rules > apply i try to clean up the SparkContext by first calling {code} > killSparkAndSqlContext {code}. This code eventually runs {code} synchronized > (sparkContextThreadLock) { > if (javaSparkContext != null) { > LOGGER.info("!!! CLEARING SPARK > CONTEXT!!!"); > javaSparkContext.stop(); > javaSparkContext = null; > sqlContext = null; > System.gc(); > } > numberOfRunningJobsForSparkContext.getAndSet(0); > } > {code}. > So at some point in time i suppose that if no other SparkSql job should run i > should kill the sparkContext (The > AbstractSparkJobRunner.killSparkAndSqlContext runs) and this should be > garbage collected from garbage collector. However this is not the case, Even > if in my debugger shows that my JavaSparkContext object is null see attached > picture {code} SparkContextPossibleMemoryLeakIDEA_DEBUG.png {code}. > The jvisual vm shows an incremental heap space even when the garbage > collector is called. See attached picture {code} SparkHeapSpaceProgress.png > {code}. > The memory analyser Tool shows that a big part of the retained heap to be > assigned to _jobProgressListener see attached picture {code} > SparkMemoryAfterLotsOfConsecutiveRuns.png {code} and summary picture {code} > SparkMemoryLeakAfterLotsOfRunsWithinTheSameContext.png {code}. Although at > the same time in Singleton Service the JavaSparkContext is null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11760) SQL Catalyst data time test fails
Jean-Baptiste Onofré created SPARK-11760: Summary: SQL Catalyst data time test fails Key: SPARK-11760 URL: https://issues.apache.org/jira/browse/SPARK-11760 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Jean-Baptiste Onofré In the sql/catalyst module, test("hours / minute / seconds") fails on the third test data: {code} - hours / miniute / seconds *** FAILED *** 29 did not equal 50 (DateTimeUtilsSuite.scala:370) {code} Actually, the problem is that it doesn't use the timezone for seconds, so, we may have to different timestamp comparison. I will submit a PR to fix that in DateTimeUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11044) Parquet writer version fixed as version1
[ https://issues.apache.org/jira/browse/SPARK-11044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11044. Resolution: Fixed Fix Version/s: 1.7.0 Issue resolved by pull request 9060 [https://github.com/apache/spark/pull/9060] > Parquet writer version fixed as version1 > > > Key: SPARK-11044 > URL: https://issues.apache.org/jira/browse/SPARK-11044 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 1.7.0 > > > Spark only writes the parquet files with writer version1 ignoring given > configuration. > It should let users choose the writer version. (remaining the default as > version1). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service
[ https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006672#comment-15006672 ] Apache Spark commented on SPARK-11191: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/9737 > [1.5] Can't create UDF's using hive thrift service > -- > > Key: SPARK-11191 > URL: https://issues.apache.org/jira/browse/SPARK-11191 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: David Ross >Assignee: Cheng Lian >Priority: Blocker > Fix For: 1.5.3, 1.6.0 > > > Since upgrading to spark 1.5 we've been unable to create and use UDF's when > we run in thrift server mode. > Our setup: > We start the thrift-server running against yarn in client mode, (we've also > built our own spark from github branch-1.5 with the following args: {{-Pyarn > -Phive -Phive-thrifeserver}} > If i run the following after connecting via JDBC (in this case via beeline): > {{add jar 'hdfs://path/to/jar"}} > (this command succeeds with no errors) > {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}} > (this command succeeds with no errors) > {{select testUDF(col1) from table1;}} > I get the following error in the logs: > {code} > org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 > pos 8 > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57) > at > org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53) > at scala.util.Try.getOrElse(Try.scala:77) > at > org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506) > at > org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > {code} > (cutting the bulk for ease of report, more than happy to send the full output) > {code} > 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive > query: > org.apache.hive.service.cli.HiveSQLException: > org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 > pos 100 > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at >
[jira] [Updated] (SPARK-11757) Incorrect join output for joining two dataframes loaded from Parquet format
[ https://issues.apache.org/jira/browse/SPARK-11757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Petri Kärkäs updated SPARK-11757: - Description: Reading in dataframes from Parquet format in s3, and executing a join between them fails when evoked by column name. Works correctly if a join condition is used instead: {code:none} sqlContext = SQLContext(sc) a = sqlContext.read.parquet('s3://path-to-data-a/') b = sqlContext.read.parquet('s3://path-to-data-b/') # result 0 rows c = a.join(b, on='id', how='left_outer') c.count() # correct output d = a.join(b, a['id']==b['id'], how='left_outer') d.count() {code} was: Reading in dataframes from Parquet format in s3, and executing a join between them fails when evoked by column name. Works correctly if a join condition is used instead: sqlContext = SQLContext(sc) a = sqlContext.read.parquet('s3://path-to-data-a/') b = sqlContext.read.parquet('s3://path-to-data-b/') # results 0 rows c = a.join(b, on='id', how='left_outer') c.count() # correct result d = a.join(b, a['id']==b['id'], how='left_outer') d.count() > Incorrect join output for joining two dataframes loaded from Parquet format > --- > > Key: SPARK-11757 > URL: https://issues.apache.org/jira/browse/SPARK-11757 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.5.0 > Environment: Python 2.7, Spark 1.5.0, Amazon linux ami > https://aws.amazon.com/amazon-linux-ami/2015.03-release-notes/ >Reporter: Petri Kärkäs > Labels: dataframe, emr, join, pyspark > > Reading in dataframes from Parquet format in s3, and executing a join between > them fails when evoked by column name. Works correctly if a join condition is > used instead: > {code:none} > sqlContext = SQLContext(sc) > a = sqlContext.read.parquet('s3://path-to-data-a/') > b = sqlContext.read.parquet('s3://path-to-data-b/') > # result 0 rows > c = a.join(b, on='id', how='left_outer') > c.count() > # correct output > d = a.join(b, a['id']==b['id'], how='left_outer') > d.count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11530) Return eigenvalues with PCA model
[ https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006654#comment-15006654 ] Apache Spark commented on SPARK-11530: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/9736 > Return eigenvalues with PCA model > - > > Key: SPARK-11530 > URL: https://issues.apache.org/jira/browse/SPARK-11530 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.1 >Reporter: Christos Iraklis Tsatsoulis > > For data scientists & statisticians, PCA is of little use if they cannot > estimate the _proportion of variance explained_ by selecting _k_ principal > components (see here for the math details: > https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section > 'Explained variance'). To estimate this, one only needs the eigenvalues of > the covariance matrix. > Although the eigenvalues are currently computed during PCA model fitting, > they are not _returned_; hence, as it stands now, PCA in Spark ML is of > extremely limited practical use. > For details, see these SO questions > http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/ > (pyspark) > http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala) > and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11530) Return eigenvalues with PCA model
[ https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11530: Assignee: Apache Spark > Return eigenvalues with PCA model > - > > Key: SPARK-11530 > URL: https://issues.apache.org/jira/browse/SPARK-11530 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.1 >Reporter: Christos Iraklis Tsatsoulis >Assignee: Apache Spark > > For data scientists & statisticians, PCA is of little use if they cannot > estimate the _proportion of variance explained_ by selecting _k_ principal > components (see here for the math details: > https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section > 'Explained variance'). To estimate this, one only needs the eigenvalues of > the covariance matrix. > Although the eigenvalues are currently computed during PCA model fitting, > they are not _returned_; hence, as it stands now, PCA in Spark ML is of > extremely limited practical use. > For details, see these SO questions > http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/ > (pyspark) > http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala) > and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11530) Return eigenvalues with PCA model
[ https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11530: Assignee: (was: Apache Spark) > Return eigenvalues with PCA model > - > > Key: SPARK-11530 > URL: https://issues.apache.org/jira/browse/SPARK-11530 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.5.1 >Reporter: Christos Iraklis Tsatsoulis > > For data scientists & statisticians, PCA is of little use if they cannot > estimate the _proportion of variance explained_ by selecting _k_ principal > components (see here for the math details: > https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section > 'Explained variance'). To estimate this, one only needs the eigenvalues of > the covariance matrix. > Although the eigenvalues are currently computed during PCA model fitting, > they are not _returned_; hence, as it stands now, PCA in Spark ML is of > extremely limited practical use. > For details, see these SO questions > http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/ > (pyspark) > http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala) > and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11757) Incorrect join output for joining two dataframes loaded from Parquet format
Petri Kärkäs created SPARK-11757: Summary: Incorrect join output for joining two dataframes loaded from Parquet format Key: SPARK-11757 URL: https://issues.apache.org/jira/browse/SPARK-11757 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.5.0 Environment: Python 2.7, Spark 1.5.0, Amazon linux ami https://aws.amazon.com/amazon-linux-ami/2015.03-release-notes/ Reporter: Petri Kärkäs Reading in dataframes from Parquet format in s3, and executing a join between them fails when evoked by column name. Works correctly if a join condition is used instead: sqlContext = SQLContext(sc) a = sqlContext.read.parquet('s3://path-to-data-a/') b = sqlContext.read.parquet('s3://path-to-data-b/') # results 0 rows c = a.join(b, on='id', how='left_outer') c.count() # correct result d = a.join(b, a['id']==b['id'], how='left_outer') d.count() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11692) Support for Parquet logical types, JSON and BSON (embedded types)
[ https://issues.apache.org/jira/browse/SPARK-11692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-11692. Resolution: Fixed Fix Version/s: 1.7.0 Issue resolved by pull request 9658 [https://github.com/apache/spark/pull/9658] > Support for Parquet logical types, JSON and BSON (embedded types) > -- > > Key: SPARK-11692 > URL: https://issues.apache.org/jira/browse/SPARK-11692 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 1.7.0 > > > Add support for Parquet logical types JSON and BSON. > Since JSON is represented as UTF-8 and BSON is binary. > {code} > org.apache.spark.sql.AnalysisException: Illegal Parquet type: BINARY (BSON); > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.illegalType$1(CatalystSchemaConverter.scala:118) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertPrimitiveField(CatalystSchemaConverter.scala:177) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:100) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$2.apply(CatalystSchemaConverter.scala:82) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$2.apply(CatalystSchemaConverter.scala:76) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11044) Parquet writer version fixed as version1
[ https://issues.apache.org/jira/browse/SPARK-11044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11044: --- Assignee: Hyukjin Kwon > Parquet writer version fixed as version1 > > > Key: SPARK-11044 > URL: https://issues.apache.org/jira/browse/SPARK-11044 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 1.7.0 > > > Spark only writes the parquet files with writer version1 ignoring given > configuration. > It should let users choose the writer version. (remaining the default as > version1). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11692) Support for Parquet logical types, JSON and BSON (embedded types)
[ https://issues.apache.org/jira/browse/SPARK-11692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11692: --- Assignee: Hyukjin Kwon > Support for Parquet logical types, JSON and BSON (embedded types) > -- > > Key: SPARK-11692 > URL: https://issues.apache.org/jira/browse/SPARK-11692 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > > Add support for Parquet logical types JSON and BSON. > Since JSON is represented as UTF-8 and BSON is binary. > {code} > org.apache.spark.sql.AnalysisException: Illegal Parquet type: BINARY (BSON); > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.illegalType$1(CatalystSchemaConverter.scala:118) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertPrimitiveField(CatalystSchemaConverter.scala:177) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:100) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$2.apply(CatalystSchemaConverter.scala:82) > at > org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$2.apply(CatalystSchemaConverter.scala:76) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10181) HiveContext is not used with keytab principal but with user principal/unix username
[ https://issues.apache.org/jira/browse/SPARK-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007071#comment-15007071 ] Yin Huai commented on SPARK-10181: -- [~bolke] I have merged it into branch-1.5. It will be release with 1.5.3. > HiveContext is not used with keytab principal but with user principal/unix > username > --- > > Key: SPARK-10181 > URL: https://issues.apache.org/jira/browse/SPARK-10181 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: kerberos >Reporter: Bolke de Bruin >Assignee: Yu Gao > Labels: hive, hivecontext, kerberos > Fix For: 1.5.3, 1.6.0 > > > `bin/spark-submit --num-executors 1 --executor-cores 5 --executor-memory 5G > --driver-java-options -XX:MaxPermSize=4G --driver-class-path > lib/datanucleus-api-jdo-3.2.6.jar:lib/datanucleus-core-3.2.10.jar:lib/datanucleus-rdbms-3.2.9.jar:conf/hive-site.xml > --files conf/hive-site.xml --master yarn --principal sparkjob --keytab > /etc/security/keytabs/sparkjob.keytab --conf > spark.yarn.executor.memoryOverhead=18000 --conf > "spark.executor.extraJavaOptions=-XX:MaxPermSize=4G" --conf > spark.eventLog.enabled=false ~/test.py` > With: > #!/usr/bin/python > from pyspark import SparkContext > from pyspark.sql import HiveContext > sc = SparkContext() > sqlContext = HiveContext(sc) > query = """ SELECT * FROM fm.sk_cluster """ > rdd = sqlContext.sql(query) > rdd.registerTempTable("test") > sqlContext.sql("CREATE TABLE wcs.test LOCATION '/tmp/test_gl' AS SELECT * > FROM test") > Ends up with: > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): > Permission denie > d: user=ua80tl, access=READ_EXECUTE, > inode="/tmp/test_gl/.hive-staging_hive_2015-08-24_10-43-09_157_78057390024057878 > 34-1/-ext-1":sparkjob:hdfs:drwxr-x--- > (Our umask denies read access to other by default) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11760) SQL Catalyst data time test fails
[ https://issues.apache.org/jira/browse/SPARK-11760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11760: Assignee: (was: Apache Spark) > SQL Catalyst data time test fails > - > > Key: SPARK-11760 > URL: https://issues.apache.org/jira/browse/SPARK-11760 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Jean-Baptiste Onofré > > In the sql/catalyst module, test("hours / minute / seconds") fails on the > third test data: > {code} > - hours / miniute / seconds *** FAILED *** > 29 did not equal 50 (DateTimeUtilsSuite.scala:370) > {code} > Actually, the problem is that it doesn't use the timezone for seconds, so, we > may have to different timestamp comparison. > I will submit a PR to fix that in DateTimeUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11760) SQL Catalyst data time test fails
[ https://issues.apache.org/jira/browse/SPARK-11760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11760: Assignee: Apache Spark > SQL Catalyst data time test fails > - > > Key: SPARK-11760 > URL: https://issues.apache.org/jira/browse/SPARK-11760 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Jean-Baptiste Onofré >Assignee: Apache Spark > > In the sql/catalyst module, test("hours / minute / seconds") fails on the > third test data: > {code} > - hours / miniute / seconds *** FAILED *** > 29 did not equal 50 (DateTimeUtilsSuite.scala:370) > {code} > Actually, the problem is that it doesn't use the timezone for seconds, so, we > may have to different timestamp comparison. > I will submit a PR to fix that in DateTimeUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11760) SQL Catalyst data time test fails
[ https://issues.apache.org/jira/browse/SPARK-11760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006932#comment-15006932 ] Apache Spark commented on SPARK-11760: -- User 'jbonofre' has created a pull request for this issue: https://github.com/apache/spark/pull/9738 > SQL Catalyst data time test fails > - > > Key: SPARK-11760 > URL: https://issues.apache.org/jira/browse/SPARK-11760 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Jean-Baptiste Onofré > > In the sql/catalyst module, test("hours / minute / seconds") fails on the > third test data: > {code} > - hours / miniute / seconds *** FAILED *** > 29 did not equal 50 (DateTimeUtilsSuite.scala:370) > {code} > Actually, the problem is that it doesn't use the timezone for seconds, so, we > may have to different timestamp comparison. > I will submit a PR to fix that in DateTimeUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11089) Add a option for thrift-server to share a single session across all connections
[ https://issues.apache.org/jira/browse/SPARK-11089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11089: Assignee: Cheng Lian (was: Apache Spark) > Add a option for thrift-server to share a single session across all > connections > --- > > Key: SPARK-11089 > URL: https://issues.apache.org/jira/browse/SPARK-11089 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Cheng Lian > > In 1.6, we improve the session support in JDBC server by separating temporary > tables and UDFs. In some cases, user may still want to share the temporary > tables or UDFs across different applications. > We should have an option or config to support that (use the original > SQLContext instead of calling newSession if it's set to true). > cc [~marmbrus] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11089) Add a option for thrift-server to share a single session across all connections
[ https://issues.apache.org/jira/browse/SPARK-11089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11089: Assignee: Apache Spark (was: Cheng Lian) > Add a option for thrift-server to share a single session across all > connections > --- > > Key: SPARK-11089 > URL: https://issues.apache.org/jira/browse/SPARK-11089 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > In 1.6, we improve the session support in JDBC server by separating temporary > tables and UDFs. In some cases, user may still want to share the temporary > tables or UDFs across different applications. > We should have an option or config to support that (use the original > SQLContext instead of calling newSession if it's set to true). > cc [~marmbrus] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11089) Add a option for thrift-server to share a single session across all connections
[ https://issues.apache.org/jira/browse/SPARK-11089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007010#comment-15007010 ] Apache Spark commented on SPARK-11089: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/9740 > Add a option for thrift-server to share a single session across all > connections > --- > > Key: SPARK-11089 > URL: https://issues.apache.org/jira/browse/SPARK-11089 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Cheng Lian > > In 1.6, we improve the session support in JDBC server by separating temporary > tables and UDFs. In some cases, user may still want to share the temporary > tables or UDFs across different applications. > We should have an option or config to support that (use the original > SQLContext instead of calling newSession if it's set to true). > cc [~marmbrus] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11281) Issue with creating and collecting DataFrame using environments
[ https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007038#comment-15007038 ] Maciej Szymkiewicz commented on SPARK-11281: [~shivaram] I've tested both current master and my PR for [SPARK-11086] and it looks it is indeed resolved. I would like to add some tests but otherwise it looks like it can be closed. > Issue with creating and collecting DataFrame using environments > > > Key: SPARK-11281 > URL: https://issues.apache.org/jira/browse/SPARK-11281 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 > Environment: R 3.2.2, Spark build from master > 487d409e71767c76399217a07af8de1bb0da7aa8 >Reporter: Maciej Szymkiewicz > Fix For: 1.6.0 > > > It is not possible to to access Map field created from an environment. > Assuming local data frame is created as follows: > {code} > ldf <- data.frame(row.names=1:2) > ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3))) > str(ldf) > ## 'data.frame': 2 obs. of 1 variable: > ## $ x:List of 2 > ## ..$ : > ## ..$ : > get("a", ldf$x[[1]]) > ## [1] 1 > get("c", ldf$x[[2]]) > ## [1] 3 > {code} > It is possible to create a Spark data frame: > {code} > sdf <- createDataFrame(sqlContext, ldf) > printSchema(sdf) > ## root > ## |-- x: array (nullable = true) > ## ||-- element: map (containsNull = true) > ## |||-- key: string > ## |||-- value: double (valueContainsNull = true) > {code} > but it throws: > {code} > java.lang.IllegalArgumentException: Invalid array type e > {code} > on collect / head. > Problem seems to be specific to environments and cannot be reproduced when > Map comes for example from Cassandra table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11760) SQL Catalyst data time test fails
[ https://issues.apache.org/jira/browse/SPARK-11760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jean-Baptiste Onofré resolved SPARK-11760. -- Resolution: Invalid It has already been fixed by: {code} commit 06f1fdba6d1425afddfc1d45a20dbe9bede15e7a Author: Wenchen FanDate: Mon Nov 16 08:58:40 2015 -0800 [SPARK-11752] [SQL] fix timezone problem for DateTimeUtils.getSeconds code snippet to reproduce it: ``` TimeZone.setDefault(TimeZone.getTimeZone("Asia/Shanghai")) val t = Timestamp.valueOf("1900-06-11 12:14:50.789") val us = fromJavaTimestamp(t) assert(getSeconds(us) === t.getSeconds) ``` it will be good to add a regression test for it, but the reproducing code need to change the default timezone, and even we change it back, the `lazy val defaultTimeZone` in `DataTimeUtils` is fixed. Author: Wenchen Fan Closes #9728 from cloud-fan/seconds. {code} > SQL Catalyst data time test fails > - > > Key: SPARK-11760 > URL: https://issues.apache.org/jira/browse/SPARK-11760 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Jean-Baptiste Onofré > > In the sql/catalyst module, test("hours / minute / seconds") fails on the > third test data: > {code} > - hours / miniute / seconds *** FAILED *** > 29 did not equal 50 (DateTimeUtilsSuite.scala:370) > {code} > Actually, the problem is that it doesn't use the timezone for seconds, so, we > may have to different timestamp comparison. > I will submit a PR to fix that in DateTimeUtils. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11759) Spark task on mesos with docker fails with sh: 1: /opt/spark/bin/spark-class: not found
[ https://issues.apache.org/jira/browse/SPARK-11759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11759: -- Component/s: Mesos Deploy > Spark task on mesos with docker fails with sh: 1: /opt/spark/bin/spark-class: > not found > --- > > Key: SPARK-11759 > URL: https://issues.apache.org/jira/browse/SPARK-11759 > Project: Spark > Issue Type: Question > Components: Deploy, Mesos >Reporter: Luis Alves > > I'm using Spark 1.5.1 and Mesos 0.25 in cluster mode. I've the > spark-dispatcher running, and run spark-submit. The driver is launched, but > it fails because it seems that the task it launches fails. > In the logs of the launched task I can see the following error: > sh: 1: /opt/spark/bin/spark-class: not found > I checked my docker image and the /opt/spark/bin/spark-class exists. I then > noticed that it's using sh, therefore I tried to run (in the docker image) > the following: > sh /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master > It fails with the following error: > spark-class: 73: spark-class: Syntax error: "(" unexpected > Is this an error in Spark? > Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11281) Issue with creating and collecting DataFrame using environments
[ https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006966#comment-15006966 ] Shivaram Venkataraman commented on SPARK-11281: --- Does the example posted in the description work now or does it still not work ? Sorry I'm just confused what the resolution to this bug was (i.e. if it was fixed or we decided we won't fix etc.) > Issue with creating and collecting DataFrame using environments > > > Key: SPARK-11281 > URL: https://issues.apache.org/jira/browse/SPARK-11281 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 > Environment: R 3.2.2, Spark build from master > 487d409e71767c76399217a07af8de1bb0da7aa8 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz > Fix For: 1.6.0 > > > It is not possible to to access Map field created from an environment. > Assuming local data frame is created as follows: > {code} > ldf <- data.frame(row.names=1:2) > ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3))) > str(ldf) > ## 'data.frame': 2 obs. of 1 variable: > ## $ x:List of 2 > ## ..$ : > ## ..$ : > get("a", ldf$x[[1]]) > ## [1] 1 > get("c", ldf$x[[2]]) > ## [1] 3 > {code} > It is possible to create a Spark data frame: > {code} > sdf <- createDataFrame(sqlContext, ldf) > printSchema(sdf) > ## root > ## |-- x: array (nullable = true) > ## ||-- element: map (containsNull = true) > ## |||-- key: string > ## |||-- value: double (valueContainsNull = true) > {code} > but it throws: > {code} > java.lang.IllegalArgumentException: Invalid array type e > {code} > on collect / head. > Problem seems to be specific to environments and cannot be reproduced when > Map comes for example from Cassandra table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11281) Issue with creating and collecting DataFrame using environments
[ https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006955#comment-15006955 ] Shivaram Venkataraman commented on SPARK-11281: --- [~sunrui] [~zero323] Is there a test case in https://github.com/apache/spark/commit/d7d9fa0b8750166f8b74f9bc321df26908683a8b that covers this ? > Issue with creating and collecting DataFrame using environments > > > Key: SPARK-11281 > URL: https://issues.apache.org/jira/browse/SPARK-11281 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 > Environment: R 3.2.2, Spark build from master > 487d409e71767c76399217a07af8de1bb0da7aa8 >Reporter: Maciej Szymkiewicz > Fix For: 1.6.0 > > > It is not possible to to access Map field created from an environment. > Assuming local data frame is created as follows: > {code} > ldf <- data.frame(row.names=1:2) > ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3))) > str(ldf) > ## 'data.frame': 2 obs. of 1 variable: > ## $ x:List of 2 > ## ..$ : > ## ..$ : > get("a", ldf$x[[1]]) > ## [1] 1 > get("c", ldf$x[[2]]) > ## [1] 3 > {code} > It is possible to create a Spark data frame: > {code} > sdf <- createDataFrame(sqlContext, ldf) > printSchema(sdf) > ## root > ## |-- x: array (nullable = true) > ## ||-- element: map (containsNull = true) > ## |||-- key: string > ## |||-- value: double (valueContainsNull = true) > {code} > but it throws: > {code} > java.lang.IllegalArgumentException: Invalid array type e > {code} > on collect / head. > Problem seems to be specific to environments and cannot be reproduced when > Map comes for example from Cassandra table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11281) Issue with creating and collecting DataFrame using environments
[ https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11281: -- Assignee: Maciej Szymkiewicz > Issue with creating and collecting DataFrame using environments > > > Key: SPARK-11281 > URL: https://issues.apache.org/jira/browse/SPARK-11281 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 > Environment: R 3.2.2, Spark build from master > 487d409e71767c76399217a07af8de1bb0da7aa8 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz > Fix For: 1.6.0 > > > It is not possible to to access Map field created from an environment. > Assuming local data frame is created as follows: > {code} > ldf <- data.frame(row.names=1:2) > ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3))) > str(ldf) > ## 'data.frame': 2 obs. of 1 variable: > ## $ x:List of 2 > ## ..$ : > ## ..$ : > get("a", ldf$x[[1]]) > ## [1] 1 > get("c", ldf$x[[2]]) > ## [1] 3 > {code} > It is possible to create a Spark data frame: > {code} > sdf <- createDataFrame(sqlContext, ldf) > printSchema(sdf) > ## root > ## |-- x: array (nullable = true) > ## ||-- element: map (containsNull = true) > ## |||-- key: string > ## |||-- value: double (valueContainsNull = true) > {code} > but it throws: > {code} > java.lang.IllegalArgumentException: Invalid array type e > {code} > on collect / head. > Problem seems to be specific to environments and cannot be reproduced when > Map comes for example from Cassandra table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11281) Issue with creating and collecting DataFrame using environments
[ https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006960#comment-15006960 ] Maciej Szymkiewicz commented on SPARK-11281: [~shivaram] No, there isn't. I removed this one because there was nothing we could test there. > Issue with creating and collecting DataFrame using environments > > > Key: SPARK-11281 > URL: https://issues.apache.org/jira/browse/SPARK-11281 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 > Environment: R 3.2.2, Spark build from master > 487d409e71767c76399217a07af8de1bb0da7aa8 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz > Fix For: 1.6.0 > > > It is not possible to to access Map field created from an environment. > Assuming local data frame is created as follows: > {code} > ldf <- data.frame(row.names=1:2) > ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3))) > str(ldf) > ## 'data.frame': 2 obs. of 1 variable: > ## $ x:List of 2 > ## ..$ : > ## ..$ : > get("a", ldf$x[[1]]) > ## [1] 1 > get("c", ldf$x[[2]]) > ## [1] 3 > {code} > It is possible to create a Spark data frame: > {code} > sdf <- createDataFrame(sqlContext, ldf) > printSchema(sdf) > ## root > ## |-- x: array (nullable = true) > ## ||-- element: map (containsNull = true) > ## |||-- key: string > ## |||-- value: double (valueContainsNull = true) > {code} > but it throws: > {code} > java.lang.IllegalArgumentException: Invalid array type e > {code} > on collect / head. > Problem seems to be specific to environments and cannot be reproduced when > Map comes for example from Cassandra table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-11281) Issue with creating and collecting DataFrame using environments
[ https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman reopened SPARK-11281: --- Assignee: (was: Maciej Szymkiewicz) > Issue with creating and collecting DataFrame using environments > > > Key: SPARK-11281 > URL: https://issues.apache.org/jira/browse/SPARK-11281 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 > Environment: R 3.2.2, Spark build from master > 487d409e71767c76399217a07af8de1bb0da7aa8 >Reporter: Maciej Szymkiewicz > Fix For: 1.6.0 > > > It is not possible to to access Map field created from an environment. > Assuming local data frame is created as follows: > {code} > ldf <- data.frame(row.names=1:2) > ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3))) > str(ldf) > ## 'data.frame': 2 obs. of 1 variable: > ## $ x:List of 2 > ## ..$ : > ## ..$ : > get("a", ldf$x[[1]]) > ## [1] 1 > get("c", ldf$x[[2]]) > ## [1] 3 > {code} > It is possible to create a Spark data frame: > {code} > sdf <- createDataFrame(sqlContext, ldf) > printSchema(sdf) > ## root > ## |-- x: array (nullable = true) > ## ||-- element: map (containsNull = true) > ## |||-- key: string > ## |||-- value: double (valueContainsNull = true) > {code} > but it throws: > {code} > java.lang.IllegalArgumentException: Invalid array type e > {code} > on collect / head. > Problem seems to be specific to environments and cannot be reproduced when > Map comes for example from Cassandra table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11716) UDFRegistration Drops Input Type Information
[ https://issues.apache.org/jira/browse/SPARK-11716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006979#comment-15006979 ] Apache Spark commented on SPARK-11716: -- User 'jbonofre' has created a pull request for this issue: https://github.com/apache/spark/pull/9739 > UDFRegistration Drops Input Type Information > > > Key: SPARK-11716 > URL: https://issues.apache.org/jira/browse/SPARK-11716 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Artjom Metro >Priority: Minor > Labels: sql, udf > > The UserDefinedFunction returned by the UDFRegistration does not contain the > input type information, although that information is available. > To fix the issue the last line of every register function would had to be > changed to "UserDefinedFunction(func, dataType, inputType)" or is there any > specific reason this was not done? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11617) MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected
[ https://issues.apache.org/jira/browse/SPARK-11617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006998#comment-15006998 ] Marcelo Vanzin edited comment on SPARK-11617 at 11/16/15 5:53 PM: -- Can you post the exceptions if they're different than the ones you posted before? was (Author: vanzin): Can you post the exception if they're different than the ones you posted before? > MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected > --- > > Key: SPARK-11617 > URL: https://issues.apache.org/jira/browse/SPARK-11617 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.6.0 >Reporter: LingZhou > > The problem may be related to > [SPARK-11235][NETWORK] Add ability to stream data using network lib. > while running on yarn-client mode, there are error messages: > 15/11/09 10:23:55 ERROR util.ResourceLeakDetector: LEAK: ByteBuf.release() > was not called before it's garbage-collected. Enable advanced leak reporting > to find out where the leak occurred. To enable advanced leak reporting, > specify the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call > ResourceLeakDetector.setLevel() See > http://netty.io/wiki/reference-counted-objects.html for more information. > and then it will cause > cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN > for exceeding memory limits. 9.0 GB of 9 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > and WARN scheduler.TaskSetManager: Lost task 105.0 in stage 1.0 (TID 2616, > gsr489): java.lang.IndexOutOfBoundsException: index: 130828, length: 16833 > (expected: range(0, 524288)). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10181) HiveContext is not used with keytab principal but with user principal/unix username
[ https://issues.apache.org/jira/browse/SPARK-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-10181: - Fix Version/s: 1.5.3 > HiveContext is not used with keytab principal but with user principal/unix > username > --- > > Key: SPARK-10181 > URL: https://issues.apache.org/jira/browse/SPARK-10181 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: kerberos >Reporter: Bolke de Bruin >Assignee: Yu Gao > Labels: hive, hivecontext, kerberos > Fix For: 1.5.3, 1.6.0 > > > `bin/spark-submit --num-executors 1 --executor-cores 5 --executor-memory 5G > --driver-java-options -XX:MaxPermSize=4G --driver-class-path > lib/datanucleus-api-jdo-3.2.6.jar:lib/datanucleus-core-3.2.10.jar:lib/datanucleus-rdbms-3.2.9.jar:conf/hive-site.xml > --files conf/hive-site.xml --master yarn --principal sparkjob --keytab > /etc/security/keytabs/sparkjob.keytab --conf > spark.yarn.executor.memoryOverhead=18000 --conf > "spark.executor.extraJavaOptions=-XX:MaxPermSize=4G" --conf > spark.eventLog.enabled=false ~/test.py` > With: > #!/usr/bin/python > from pyspark import SparkContext > from pyspark.sql import HiveContext > sc = SparkContext() > sqlContext = HiveContext(sc) > query = """ SELECT * FROM fm.sk_cluster """ > rdd = sqlContext.sql(query) > rdd.registerTempTable("test") > sqlContext.sql("CREATE TABLE wcs.test LOCATION '/tmp/test_gl' AS SELECT * > FROM test") > Ends up with: > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): > Permission denie > d: user=ua80tl, access=READ_EXECUTE, > inode="/tmp/test_gl/.hive-staging_hive_2015-08-24_10-43-09_157_78057390024057878 > 34-1/-ext-1":sparkjob:hdfs:drwxr-x--- > (Our umask denies read access to other by default) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11512) Bucket Join
[ https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006975#comment-15006975 ] Alex Nastetsky commented on SPARK-11512: There are 3 situations: 1) dataset A and dataset B are both partitioned/sorted the same on disk and need to be joined. should be able to take advantage of their partitioning/sort. 2) dataset A is partitioned/sorted on disk, dataset B gets generated during the app run and needs to be joined to dataset A. should be able to take advantage of dataset A's partitioning/sort and mimic the same partitioning/sort on dataset B, without having to pre-process dataset A. perhaps, something like repartitionAndSortWithinPartitions to be performed on dataset B? 3) dataset A and B are both generated during the app run and need to be joined. I believe doing a Sort Merge Join on these is already supported in SPARK-2213. The first 2 situations is what this ticket is for. > Bucket Join > --- > > Key: SPARK-11512 > URL: https://issues.apache.org/jira/browse/SPARK-11512 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Cheng Hao > > Sort merge join on two datasets on the file system that have already been > partitioned the same with the same number of partitions and sorted within > each partition, and we don't need to sort it again while join with the > sorted/partitioned keys > This functionality exists in > - Hive (hive.optimize.bucketmapjoin.sortedmerge) > - Pig (USING 'merge') > - MapReduce (CompositeInputFormat) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11655) SparkLauncherBackendSuite leaks child processes
[ https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006978#comment-15006978 ] shane knapp commented on SPARK-11655: - just wanted to say that things are definitely looking a LOT better! i'll keep an eye on things this week, but we're definitely out of the woods. thanks [~joshrosen] and [~vanzin]! > SparkLauncherBackendSuite leaks child processes > --- > > Key: SPARK-11655 > URL: https://issues.apache.org/jira/browse/SPARK-11655 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Josh Rosen >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 1.6.0 > > Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png > > > We've been combatting an orphaned process issue on AMPLab Jenkins since > October and I finally was able to dig in and figure out what's going on. > After some sleuthing and working around OS limits and JDK bugs, I was able to > get the full launch commands for the hanging orphaned processes. It looks > like they're all running spark-submit: > {code} > org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf > spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/ > -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m > {code} > Based on the output of some Ganglia graphs, I was able to figure out that > these leaks started around October 9. > !screenshot-1.png|thumbnail! > This roughly lines up with when https://github.com/apache/spark/pull/7052 was > merged, which added LauncherBackendSuite. The launch arguments used in this > suite seem to line up with the arguments that I observe in the hanging > processes' {{jps}} output: > https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46 > Interestingly, Jenkins doesn't show test timing or output for this suite! I > think that what might be happening is that we have a mixed Scala/Java > package, so maybe the two test runner XML files aren't being merged properly: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/ > Whenever I try running this suite locally, it looks like it ends up creating > a zombie SparkSubmit process! I think that what's happening is that the > launcher's {{handle.kill()}} call ends up destroying the bash > {{spark-submit}} subprocess such that its child process (a JVM) leaks. > I think that we'll have to do something similar to what we do in PySpark when > launching a child JVM from a Python / Bash process: connect it to a socket or > stream such that it can detect its parent's death and clean up after itself > appropriately. > /cc [~shaneknapp] and [~vanzin]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11281) Issue with creating and collecting DataFrame using environments
[ https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007064#comment-15007064 ] Shivaram Venkataraman commented on SPARK-11281: --- Thats cool ! lets keep this open till we add tests and then close it as a part of that PR > Issue with creating and collecting DataFrame using environments > > > Key: SPARK-11281 > URL: https://issues.apache.org/jira/browse/SPARK-11281 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 > Environment: R 3.2.2, Spark build from master > 487d409e71767c76399217a07af8de1bb0da7aa8 >Reporter: Maciej Szymkiewicz > Fix For: 1.6.0 > > > It is not possible to to access Map field created from an environment. > Assuming local data frame is created as follows: > {code} > ldf <- data.frame(row.names=1:2) > ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3))) > str(ldf) > ## 'data.frame': 2 obs. of 1 variable: > ## $ x:List of 2 > ## ..$ : > ## ..$ : > get("a", ldf$x[[1]]) > ## [1] 1 > get("c", ldf$x[[2]]) > ## [1] 3 > {code} > It is possible to create a Spark data frame: > {code} > sdf <- createDataFrame(sqlContext, ldf) > printSchema(sdf) > ## root > ## |-- x: array (nullable = true) > ## ||-- element: map (containsNull = true) > ## |||-- key: string > ## |||-- value: double (valueContainsNull = true) > {code} > but it throws: > {code} > java.lang.IllegalArgumentException: Invalid array type e > {code} > on collect / head. > Problem seems to be specific to environments and cannot be reproduced when > Map comes for example from Cassandra table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11569) StringIndexer transform fails when column contains nulls
[ https://issues.apache.org/jira/browse/SPARK-11569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007131#comment-15007131 ] Joseph K. Bradley commented on SPARK-11569: --- To choose the right API, my first comments are: * What do other libraries do when given null/bad values? (scikit-learn and R are the ones I tend to look at.) * I'd prefer to make the behavior adjustable using an option with a default. The default I'd vote for is throwing a nice error upon seeing null, though I could be convinced to go for another. * When we do index null, we should ideally maintain current indexing behavior, so it may make the most sense to put null at the end. > StringIndexer transform fails when column contains nulls > > > Key: SPARK-11569 > URL: https://issues.apache.org/jira/browse/SPARK-11569 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.4.0, 1.5.0, 1.6.0 >Reporter: Maciej Szymkiewicz > > Transforming column containing {{null}} values using {{StringIndexer}} > results in {{java.lang.NullPointerException}} > {code} > from pyspark.ml.feature import StringIndexer > df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v")) > df.printSchema() > ## root > ## |-- k: string (nullable = true) > ## |-- v: long (nullable = true) > indexer = StringIndexer(inputCol="k", outputCol="kIdx") > indexer.fit(df).transform(df) > ##py4j.protocol.Py4JJavaError: An error occurred while calling o75.json. > ## : java.lang.NullPointerException > {code} > Problem disappears when we drop > {code} > df1 = df.na.drop() > indexer.fit(df1).transform(df1) > {code} > or replace {{nulls}} > {code} > from pyspark.sql.functions import col, when > k = col("k") > df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k)) > indexer.fit(df2).transform(df2) > {code} > and cannot be reproduced using Scala API > {code} > import org.apache.spark.ml.feature.StringIndexer > val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v") > df.printSchema > // root > // |-- k: string (nullable = true) > // |-- v: integer (nullable = false) > val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx") > indexer.fit(df).transform(df).count > // 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11044) Parquet writer version fixed as version1
[ https://issues.apache.org/jira/browse/SPARK-11044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-11044: --- Fix Version/s: 1.6.0 > Parquet writer version fixed as version1 > > > Key: SPARK-11044 > URL: https://issues.apache.org/jira/browse/SPARK-11044 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 1.6.0, 1.7.0 > > > Spark only writes the parquet files with writer version1 ignoring given > configuration. > It should let users choose the writer version. (remaining the default as > version1). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11743) Add UserDefinedType support to RowEncoder
[ https://issues.apache.org/jira/browse/SPARK-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11743: -- Assignee: Liang-Chi Hsieh > Add UserDefinedType support to RowEncoder > - > > Key: SPARK-11743 > URL: https://issues.apache.org/jira/browse/SPARK-11743 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 1.6.0 > > > RowEncoder doesn't support UserDefinedType now. We should add the support for > it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11752) fix timezone problem for DateTimeUtils.getSeconds
[ https://issues.apache.org/jira/browse/SPARK-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11752: -- Assignee: Wenchen Fan > fix timezone problem for DateTimeUtils.getSeconds > - > > Key: SPARK-11752 > URL: https://issues.apache.org/jira/browse/SPARK-11752 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 1.5.3, 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11716) UDFRegistration Drops Input Type Information
[ https://issues.apache.org/jira/browse/SPARK-11716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11716: Assignee: (was: Apache Spark) > UDFRegistration Drops Input Type Information > > > Key: SPARK-11716 > URL: https://issues.apache.org/jira/browse/SPARK-11716 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Artjom Metro >Priority: Minor > Labels: sql, udf > > The UserDefinedFunction returned by the UDFRegistration does not contain the > input type information, although that information is available. > To fix the issue the last line of every register function would had to be > changed to "UserDefinedFunction(func, dataType, inputType)" or is there any > specific reason this was not done? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11716) UDFRegistration Drops Input Type Information
[ https://issues.apache.org/jira/browse/SPARK-11716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11716: Assignee: Apache Spark > UDFRegistration Drops Input Type Information > > > Key: SPARK-11716 > URL: https://issues.apache.org/jira/browse/SPARK-11716 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Artjom Metro >Assignee: Apache Spark >Priority: Minor > Labels: sql, udf > > The UserDefinedFunction returned by the UDFRegistration does not contain the > input type information, although that information is available. > To fix the issue the last line of every register function would had to be > changed to "UserDefinedFunction(func, dataType, inputType)" or is there any > specific reason this was not done? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-11281) Issue with creating and collecting DataFrame using environments
[ https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-11281: --- Comment: was deleted (was: [~sunrui], [~shivaram] I don't think it is resolved by [SPARK-11086]. ) > Issue with creating and collecting DataFrame using environments > > > Key: SPARK-11281 > URL: https://issues.apache.org/jira/browse/SPARK-11281 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 > Environment: R 3.2.2, Spark build from master > 487d409e71767c76399217a07af8de1bb0da7aa8 >Reporter: Maciej Szymkiewicz > Fix For: 1.6.0 > > > It is not possible to to access Map field created from an environment. > Assuming local data frame is created as follows: > {code} > ldf <- data.frame(row.names=1:2) > ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3))) > str(ldf) > ## 'data.frame': 2 obs. of 1 variable: > ## $ x:List of 2 > ## ..$ : > ## ..$ : > get("a", ldf$x[[1]]) > ## [1] 1 > get("c", ldf$x[[2]]) > ## [1] 3 > {code} > It is possible to create a Spark data frame: > {code} > sdf <- createDataFrame(sqlContext, ldf) > printSchema(sdf) > ## root > ## |-- x: array (nullable = true) > ## ||-- element: map (containsNull = true) > ## |||-- key: string > ## |||-- value: double (valueContainsNull = true) > {code} > but it throws: > {code} > java.lang.IllegalArgumentException: Invalid array type e > {code} > on collect / head. > Problem seems to be specific to environments and cannot be reproduced when > Map comes for example from Cassandra table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11617) MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected
[ https://issues.apache.org/jira/browse/SPARK-11617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006998#comment-15006998 ] Marcelo Vanzin commented on SPARK-11617: Can you post the exception if they're different than the ones you posted before? > MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected > --- > > Key: SPARK-11617 > URL: https://issues.apache.org/jira/browse/SPARK-11617 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.6.0 >Reporter: LingZhou > > The problem may be related to > [SPARK-11235][NETWORK] Add ability to stream data using network lib. > while running on yarn-client mode, there are error messages: > 15/11/09 10:23:55 ERROR util.ResourceLeakDetector: LEAK: ByteBuf.release() > was not called before it's garbage-collected. Enable advanced leak reporting > to find out where the leak occurred. To enable advanced leak reporting, > specify the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call > ResourceLeakDetector.setLevel() See > http://netty.io/wiki/reference-counted-objects.html for more information. > and then it will cause > cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN > for exceeding memory limits. 9.0 GB of 9 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > and WARN scheduler.TaskSetManager: Lost task 105.0 in stage 1.0 (TID 2616, > gsr489): java.lang.IndexOutOfBoundsException: index: 130828, length: 16833 > (expected: range(0, 524288)). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11713) Initial RDD for updateStateByKey for pyspark
[ https://issues.apache.org/jira/browse/SPARK-11713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007004#comment-15007004 ] Bryan Cutler commented on SPARK-11713: -- I could work on this > Initial RDD for updateStateByKey for pyspark > > > Key: SPARK-11713 > URL: https://issues.apache.org/jira/browse/SPARK-11713 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: David Watson > > It would be infinitely useful to add initial rdd to the pyspark DStream > interface to match the scala and java interfaces > (https://issues.apache.org/jira/browse/SPARK-3660). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11553) row.getInt(i) if row[i]=null returns 0
[ https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11553: - Target Version/s: 1.6.0 Priority: Blocker (was: Minor) > row.getInt(i) if row[i]=null returns 0 > -- > > Key: SPARK-11553 > URL: https://issues.apache.org/jira/browse/SPARK-11553 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Tofigh >Priority: Blocker > > row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even > according to the document they should throw nullException error) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11718) Explicit killing executor dies silent without get response information
[ https://issues.apache.org/jira/browse/SPARK-11718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-11718. Resolution: Fixed Assignee: Saisai Shao Fix Version/s: 1.6.0 > Explicit killing executor dies silent without get response information > -- > > Key: SPARK-11718 > URL: https://issues.apache.org/jira/browse/SPARK-11718 > Project: Spark > Issue Type: Bug > Components: Scheduler, YARN >Affects Versions: 1.6.0 >Reporter: Saisai Shao >Assignee: Saisai Shao > Fix For: 1.6.0 > > > Because of change of AM and scheduler executor failure detection mechanism, > explicit killing executor can not response back to driver, this will make > dynamic allocation wrongly maintain the executor metadata. > I'm working on this... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11684) Update user guide to show new features in SparkR:::glm and SparkR:::summary
[ https://issues.apache.org/jira/browse/SPARK-11684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11684: -- Shepherd: Xiangrui Meng > Update user guide to show new features in SparkR:::glm and SparkR:::summary > --- > > Key: SPARK-11684 > URL: https://issues.apache.org/jira/browse/SPARK-11684 > Project: Spark > Issue Type: Documentation > Components: ML, SparkR >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > * feature interaction in R formula > * model coefficients in logistic regression > * model summary in linear regression -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11439) Optimization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007482#comment-15007482 ] Apache Spark commented on SPARK-11439: -- User 'nakul02' has created a pull request for this issue: https://github.com/apache/spark/pull/9745 > Optimization of creating sparse feature without dense one > - > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11439) Optimization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11439: Assignee: Apache Spark > Optimization of creating sparse feature without dense one > - > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Assignee: Apache Spark >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11762) TransportResponseHandler should consider open streams when counting outstanding requests
Marcelo Vanzin created SPARK-11762: -- Summary: TransportResponseHandler should consider open streams when counting outstanding requests Key: SPARK-11762 URL: https://issues.apache.org/jira/browse/SPARK-11762 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.6.0 Reporter: Marcelo Vanzin Priority: Minor This code in TransportResponseHandler: {code} public int numOutstandingRequests() { return outstandingFetches.size() + outstandingRpcs.size(); } {code} Is used to determine if the channel is currently in use; if there's a timeout and the channel is in use, then the channel is closed. But it currently does not consider open streams (just block fetches and RPCs), so if a timeout happens during a stream transfer, the channel will remain open. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps
[ https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reopened SPARK-11016: Assignee: (was: Liang-Chi Hsieh) https://github.com/apache/spark/pull/9243 is reverted > Spark fails when running with a task that requires a more recent version of > RoaringBitmaps > -- > > Key: SPARK-11016 > URL: https://issues.apache.org/jira/browse/SPARK-11016 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Charles Allen > Fix For: 1.6.0 > > > The following error appears during Kryo init whenever a more recent version > (>0.5.0) of Roaring bitmaps is required by a job. > org/roaringbitmap/RoaringArray$Element was removed in 0.5.0 > {code} > A needed class was not found. This could be due to an error in your runpath. > Missing class: org/roaringbitmap/RoaringArray$Element > java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element > at > org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338) > at > org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala) > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237) > at > org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222) > at > org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138) > at > org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) > at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) > at org.apache.spark.SparkContext.textFile(SparkContext.scala:816) > {code} > See https://issues.apache.org/jira/browse/SPARK-5949 for related info -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps
[ https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007565#comment-15007565 ] Davies Liu commented on SPARK-11016: [~charles.al...@acxiom.com] Could you send your patch to github.com/apache/spark ? > Spark fails when running with a task that requires a more recent version of > RoaringBitmaps > -- > > Key: SPARK-11016 > URL: https://issues.apache.org/jira/browse/SPARK-11016 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Charles Allen > Fix For: 1.6.0 > > > The following error appears during Kryo init whenever a more recent version > (>0.5.0) of Roaring bitmaps is required by a job. > org/roaringbitmap/RoaringArray$Element was removed in 0.5.0 > {code} > A needed class was not found. This could be due to an error in your runpath. > Missing class: org/roaringbitmap/RoaringArray$Element > java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element > at > org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338) > at > org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala) > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237) > at > org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222) > at > org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138) > at > org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) > at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) > at org.apache.spark.SparkContext.textFile(SparkContext.scala:816) > {code} > See https://issues.apache.org/jira/browse/SPARK-5949 for related info -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11756) SparkR can not output help information for SparkR:::summary correctly
[ https://issues.apache.org/jira/browse/SPARK-11756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007481#comment-15007481 ] Felix Cheung commented on SPARK-11756: -- [~yanboliang] Could you please clarify what is the issue? That it shows 'Summaries {base}'? Or it says 'describe {SparkR}'? > SparkR can not output help information for SparkR:::summary correctly > - > > Key: SPARK-11756 > URL: https://issues.apache.org/jira/browse/SPARK-11756 > Project: Spark > Issue Type: Bug > Components: R, SparkR >Reporter: Yanbo Liang > > R users often get help information for a method like this: > {code} > > ?summary > {code} > or > {code} > > help(summary) > {code} > For SparkR we should provide the help information for the SparkR package and > base R package(usually stats package). > But for "summary" method, it can not output the help information correctly. > {code} > > help(summary) > Help on topic ‘summary’ was found in the following packages: > Package Library > SparkR/Users/yanboliang/data/trunk2/spark/R/lib > base /Library/Frameworks/R.framework/Resources/library > Choose one > 1: describe {SparkR} > 2: Object Summaries {base} > {code} > It can only output the help of describe(DataFrame) which is synonymous with > summary(DataFrame), we also need the help information of > summary(PipelineModel). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11390) Query plan with/without filterPushdown indistinguishable
[ https://issues.apache.org/jira/browse/SPARK-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11390. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9679 [https://github.com/apache/spark/pull/9679] > Query plan with/without filterPushdown indistinguishable > > > Key: SPARK-11390 > URL: https://issues.apache.org/jira/browse/SPARK-11390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: All >Reporter: Vishesh Garg >Priority: Minor > Fix For: 1.6.0 > > > The execution plan of a query remains the same regardless of whether the > filterPushdown flag has been set to "true" or "false", as can be seen below: > == > scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "false") > scala> sqlContext.sql("SELECT name FROM people WHERE age = 15").explain() > == Physical Plan == > Project [name#6] > Filter (age#7 = 15) > Scan OrcRelation[hdfs://localhost:9000/user/spec/people][name#6,age#7] > scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "true") > scala> sqlContext.sql("SELECT name FROM people WHERE age = 15").explain() > == Physical Plan == > Project [name#6] > Filter (age#7 = 15) > Scan OrcRelation[hdfs://localhost:9000/user/spec/people][name#6,age#7] > == > Ideally, when the filterPushdown flag is set to "true", both the scan and the > filter nodes should be merged together to make it clear that the filtering is > being done by the data source itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11756) SparkR can not output help information for SparkR:::summary correctly
[ https://issues.apache.org/jira/browse/SPARK-11756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-11756: - Component/s: SparkR > SparkR can not output help information for SparkR:::summary correctly > - > > Key: SPARK-11756 > URL: https://issues.apache.org/jira/browse/SPARK-11756 > Project: Spark > Issue Type: Bug > Components: R, SparkR >Reporter: Yanbo Liang > > R users often get help information for a method like this: > {code} > > ?summary > {code} > or > {code} > > help(summary) > {code} > For SparkR we should provide the help information for the SparkR package and > base R package(usually stats package). > But for "summary" method, it can not output the help information correctly. > {code} > > help(summary) > Help on topic ‘summary’ was found in the following packages: > Package Library > SparkR/Users/yanboliang/data/trunk2/spark/R/lib > base /Library/Frameworks/R.framework/Resources/library > Choose one > 1: describe {SparkR} > 2: Object Summaries {base} > {code} > It can only output the help of describe(DataFrame) which is synonymous with > summary(DataFrame), we also need the help information of > summary(PipelineModel). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11439) Optimization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007487#comment-15007487 ] Nakul Jindal commented on SPARK-11439: -- Thanks [~lewuathe]. I've also updated the comment in the LinearRegressionSuite.scala file with an R snippet to reproduce the results. > Optimization of creating sparse feature without dense one > - > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11747) Can not specify input path in python logistic_regression example under ml
[ https://issues.apache.org/jira/browse/SPARK-11747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007528#comment-15007528 ] Joseph K. Bradley commented on SPARK-11747: --- Although there are some examples which are essentially command-line scripts, most examples are really meant to be copied and modified as needed. We may need to wait on this, depending on how testable example code refactoring happens: [SPARK-11337] > Can not specify input path in python logistic_regression example under ml > - > > Key: SPARK-11747 > URL: https://issues.apache.org/jira/browse/SPARK-11747 > Project: Spark > Issue Type: Improvement > Components: Examples >Reporter: Jeff Zhang >Priority: Minor > > Not sure why it is hard coded, it would be nice to allow user to specify > input path > {code} > # Load and parse the data file into a dataframe. > df = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11390) Query plan with/without filterPushdown indistinguishable
[ https://issues.apache.org/jira/browse/SPARK-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11390: - Assignee: Zee Chen > Query plan with/without filterPushdown indistinguishable > > > Key: SPARK-11390 > URL: https://issues.apache.org/jira/browse/SPARK-11390 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: All >Reporter: Vishesh Garg >Assignee: Zee Chen >Priority: Minor > Fix For: 1.6.0 > > > The execution plan of a query remains the same regardless of whether the > filterPushdown flag has been set to "true" or "false", as can be seen below: > == > scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "false") > scala> sqlContext.sql("SELECT name FROM people WHERE age = 15").explain() > == Physical Plan == > Project [name#6] > Filter (age#7 = 15) > Scan OrcRelation[hdfs://localhost:9000/user/spec/people][name#6,age#7] > scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "true") > scala> sqlContext.sql("SELECT name FROM people WHERE age = 15").explain() > == Physical Plan == > Project [name#6] > Filter (age#7 = 15) > Scan OrcRelation[hdfs://localhost:9000/user/spec/people][name#6,age#7] > == > Ideally, when the filterPushdown flag is set to "true", both the scan and the > filter nodes should be merged together to make it clear that the filtering is > being done by the data source itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11439) Optimization of creating sparse feature without dense one
[ https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11439: Assignee: (was: Apache Spark) > Optimization of creating sparse feature without dense one > - > > Key: SPARK-11439 > URL: https://issues.apache.org/jira/browse/SPARK-11439 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Kai Sasaki >Priority: Minor > > Currently, sparse feature generated in {{LinearDataGenerator}} needs to > create dense vectors once. It is cost efficient to prevent from generating > dense feature when creating sparse features. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-11271) MapStatus too large for driver
[ https://issues.apache.org/jira/browse/SPARK-11271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reopened SPARK-11271: https://github.com/apache/spark/pull/9243 is reverted > MapStatus too large for driver > -- > > Key: SPARK-11271 > URL: https://issues.apache.org/jira/browse/SPARK-11271 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Reporter: Kent Yao >Assignee: Liang-Chi Hsieh > Fix For: 1.6.0 > > > When I run a spark job contains quite a lot of tasks(in my case is > 200k[maptasks]*200k[reducetasks]), the driver occured OOM mainly caused by > the object MapStatus, > RoaringBitmap that used to mark which block is empty seems to use too many > memories. > I try to use org.apache.spark.util.collection.BitSet instead of > RoaringBitMap, and it can save about 20% memories. > For the 200K tasks job, > RoaringBitMap uses 3 Long[1024] and 1 Short[3392] > =3*64*1024+16*3392=250880(bit) > BitSet uses 1 Long[3125] = 3125*64=20(bit) > Memory saved = (250880-20) / 250880 ≈20% -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6328) Python API for StreamingListener
[ https://issues.apache.org/jira/browse/SPARK-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-6328. -- Resolution: Fixed Fix Version/s: 1.6.0 > Python API for StreamingListener > > > Key: SPARK-6328 > URL: https://issues.apache.org/jira/browse/SPARK-6328 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Yifan Wang > Fix For: 1.6.0 > > > StreamingListener API is only available in Java/Scala. It will be useful to > make it available in Python so that Spark application written in python can > check the status of ongoing streaming computation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0
[ https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11720: -- Component/s: ML > Return Double.NaN instead of null for Mean and Average when count = 0 > - > > Key: SPARK-11720 > URL: https://issues.apache.org/jira/browse/SPARK-11720 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA >Assignee: Jihong MA >Priority: Minor > > change the default behavior of mean in case of count = 0 from null to > Double.NaN, to make it inline with all other univariate stats function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0
[ https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11720: -- Assignee: Jihong MA > Return Double.NaN instead of null for Mean and Average when count = 0 > - > > Key: SPARK-11720 > URL: https://issues.apache.org/jira/browse/SPARK-11720 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Jihong MA >Assignee: Jihong MA >Priority: Minor > > change the default behavior of mean in case of count = 0 from null to > Double.NaN, to make it inline with all other univariate stats function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11761) Prevent the call to StreamingContext#stop() in the listener bus's thread
[ https://issues.apache.org/jira/browse/SPARK-11761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11761: Assignee: Apache Spark > Prevent the call to StreamingContext#stop() in the listener bus's thread > > > Key: SPARK-11761 > URL: https://issues.apache.org/jira/browse/SPARK-11761 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Ted Yu >Assignee: Apache Spark > > Quoting Shixiong's comment from https://github.com/apache/spark/pull/9723 : > {code} > The user should not call stop or other long-time work in a listener since it > will block the listener thread, and prevent from stopping > SparkContext/StreamingContext. > I cannot see an approach since we need to stop the listener bus's thread > before stopping SparkContext/StreamingContext totally. > {code} > Proposed solution is to prevent the call to StreamingContext#stop() in the > listener bus's thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11761) Prevent the call to StreamingContext#stop() in the listener bus's thread
[ https://issues.apache.org/jira/browse/SPARK-11761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11761: Assignee: (was: Apache Spark) > Prevent the call to StreamingContext#stop() in the listener bus's thread > > > Key: SPARK-11761 > URL: https://issues.apache.org/jira/browse/SPARK-11761 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Ted Yu > > Quoting Shixiong's comment from https://github.com/apache/spark/pull/9723 : > {code} > The user should not call stop or other long-time work in a listener since it > will block the listener thread, and prevent from stopping > SparkContext/StreamingContext. > I cannot see an approach since we need to stop the listener bus's thread > before stopping SparkContext/StreamingContext totally. > {code} > Proposed solution is to prevent the call to StreamingContext#stop() in the > listener bus's thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11761) Prevent the call to StreamingContext#stop() in the listener bus's thread
[ https://issues.apache.org/jira/browse/SPARK-11761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007215#comment-15007215 ] Apache Spark commented on SPARK-11761: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/9741 > Prevent the call to StreamingContext#stop() in the listener bus's thread > > > Key: SPARK-11761 > URL: https://issues.apache.org/jira/browse/SPARK-11761 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Ted Yu > > Quoting Shixiong's comment from https://github.com/apache/spark/pull/9723 : > {code} > The user should not call stop or other long-time work in a listener since it > will block the listener thread, and prevent from stopping > SparkContext/StreamingContext. > I cannot see an approach since we need to stop the listener bus's thread > before stopping SparkContext/StreamingContext totally. > {code} > Proposed solution is to prevent the call to StreamingContext#stop() in the > listener bus's thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11754) consolidate `ExpressionEncoder.tuple` and `Encoders.tuple`
[ https://issues.apache.org/jira/browse/SPARK-11754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11754. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9729 [https://github.com/apache/spark/pull/9729] > consolidate `ExpressionEncoder.tuple` and `Encoders.tuple` > -- > > Key: SPARK-11754 > URL: https://issues.apache.org/jira/browse/SPARK-11754 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11754) consolidate `ExpressionEncoder.tuple` and `Encoders.tuple`
[ https://issues.apache.org/jira/browse/SPARK-11754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11754: - Assignee: Wenchen Fan > consolidate `ExpressionEncoder.tuple` and `Encoders.tuple` > -- > > Key: SPARK-11754 > URL: https://issues.apache.org/jira/browse/SPARK-11754 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11732) MiMa excludes miss private classes
[ https://issues.apache.org/jira/browse/SPARK-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11732: -- Target Version/s: 1.6.0 > MiMa excludes miss private classes > -- > > Key: SPARK-11732 > URL: https://issues.apache.org/jira/browse/SPARK-11732 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.1 >Reporter: Tim Hunter >Assignee: Tim Hunter > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > The checks in GenerateMIMAIgnore only check for package private classes, not > private classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11732) MiMa excludes miss private classes
[ https://issues.apache.org/jira/browse/SPARK-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-11732: -- Assignee: Tim Hunter > MiMa excludes miss private classes > -- > > Key: SPARK-11732 > URL: https://issues.apache.org/jira/browse/SPARK-11732 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 1.5.1 >Reporter: Tim Hunter >Assignee: Tim Hunter > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > The checks in GenerateMIMAIgnore only check for package private classes, not > private classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11731) Enable batching on Driver WriteAheadLog by default
[ https://issues.apache.org/jira/browse/SPARK-11731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-11731. --- Resolution: Fixed Assignee: Burak Yavuz Fix Version/s: 1.6.0 > Enable batching on Driver WriteAheadLog by default > -- > > Key: SPARK-11731 > URL: https://issues.apache.org/jira/browse/SPARK-11731 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 1.6.0 > > > Using batching on the driver for the WriteAheadLog should be an improvement > for all environments and use cases. Users will be able to scale to much > higher number of receivers with the BatchedWriteAheadLog. Therefore we should > turn it on by default, and QA it in the QA period. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11761) Prevent the call to StreamingContext#stop() in the listener bus's thread
Ted Yu created SPARK-11761: -- Summary: Prevent the call to StreamingContext#stop() in the listener bus's thread Key: SPARK-11761 URL: https://issues.apache.org/jira/browse/SPARK-11761 Project: Spark Issue Type: Bug Components: Streaming Reporter: Ted Yu Quoting Shixiong's comment from https://github.com/apache/spark/pull/9723 : {code} The user should not call stop or other long-time work in a listener since it will block the listener thread, and prevent from stopping SparkContext/StreamingContext. I cannot see an approach since we need to stop the listener bus's thread before stopping SparkContext/StreamingContext totally. {code} Proposed solution is to prevent the call to StreamingContext#stop() in the listener bus's thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11319) PySpark silently Accepts null values in non-nullable DataFrame fields.
[ https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007220#comment-15007220 ] Daniel Jalova commented on SPARK-11319: --- Seems that this is possible in the Scala API too. > PySpark silently Accepts null values in non-nullable DataFrame fields. > -- > > Key: SPARK-11319 > URL: https://issues.apache.org/jira/browse/SPARK-11319 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Kevin Cox > > Running the following code with a null value in a non-nullable column > silently works. This makes the code incredibly hard to trust. > {code} > In [2]: from pyspark.sql.types import * > In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a", > TimestampType(), False)])).collect() > Out[3]: [Row(a=None)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6328) Python API for StreamingListener
[ https://issues.apache.org/jira/browse/SPARK-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-6328: - Assignee: Yifan Wang > Python API for StreamingListener > > > Key: SPARK-6328 > URL: https://issues.apache.org/jira/browse/SPARK-6328 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Yifan Wang >Assignee: Yifan Wang > Fix For: 1.6.0 > > > StreamingListener API is only available in Java/Scala. It will be useful to > make it available in Python so that Spark application written in python can > check the status of ongoing streaming computation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11281) Issue with creating and collecting DataFrame using environments
[ https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007272#comment-15007272 ] Apache Spark commented on SPARK-11281: -- User 'zero323' has created a pull request for this issue: https://github.com/apache/spark/pull/9743 > Issue with creating and collecting DataFrame using environments > > > Key: SPARK-11281 > URL: https://issues.apache.org/jira/browse/SPARK-11281 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 > Environment: R 3.2.2, Spark build from master > 487d409e71767c76399217a07af8de1bb0da7aa8 >Reporter: Maciej Szymkiewicz > Fix For: 1.6.0 > > > It is not possible to to access Map field created from an environment. > Assuming local data frame is created as follows: > {code} > ldf <- data.frame(row.names=1:2) > ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3))) > str(ldf) > ## 'data.frame': 2 obs. of 1 variable: > ## $ x:List of 2 > ## ..$ : > ## ..$ : > get("a", ldf$x[[1]]) > ## [1] 1 > get("c", ldf$x[[2]]) > ## [1] 3 > {code} > It is possible to create a Spark data frame: > {code} > sdf <- createDataFrame(sqlContext, ldf) > printSchema(sdf) > ## root > ## |-- x: array (nullable = true) > ## ||-- element: map (containsNull = true) > ## |||-- key: string > ## |||-- value: double (valueContainsNull = true) > {code} > but it throws: > {code} > java.lang.IllegalArgumentException: Invalid array type e > {code} > on collect / head. > Problem seems to be specific to environments and cannot be reproduced when > Map comes for example from Cassandra table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11617) MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected
[ https://issues.apache.org/jira/browse/SPARK-11617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007315#comment-15007315 ] Marcelo Vanzin commented on SPARK-11617: BTW, I updated the PR with a test case that fails with the exceptions you saw if I disable the fix; they pass consistently with the fix applied. I also ran several jobs that do a lot of shuffles and didn't see any problems with the latest fix applied. > MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected > --- > > Key: SPARK-11617 > URL: https://issues.apache.org/jira/browse/SPARK-11617 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.6.0 >Reporter: LingZhou > > The problem may be related to > [SPARK-11235][NETWORK] Add ability to stream data using network lib. > while running on yarn-client mode, there are error messages: > 15/11/09 10:23:55 ERROR util.ResourceLeakDetector: LEAK: ByteBuf.release() > was not called before it's garbage-collected. Enable advanced leak reporting > to find out where the leak occurred. To enable advanced leak reporting, > specify the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call > ResourceLeakDetector.setLevel() See > http://netty.io/wiki/reference-counted-objects.html for more information. > and then it will cause > cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN > for exceeding memory limits. 9.0 GB of 9 GB physical memory used. Consider > boosting spark.yarn.executor.memoryOverhead. > and WARN scheduler.TaskSetManager: Lost task 105.0 in stage 1.0 (TID 2616, > gsr489): java.lang.IndexOutOfBoundsException: index: 130828, length: 16833 > (expected: range(0, 524288)). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9065) Add the ability to specify message handler function in python similar to Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-9065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007192#comment-15007192 ] Apache Spark commented on SPARK-9065: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/9742 > Add the ability to specify message handler function in python similar to > Scala/Java > --- > > Key: SPARK-9065 > URL: https://issues.apache.org/jira/browse/SPARK-9065 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Streaming >Reporter: Tathagata Das >Assignee: Saisai Shao > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-11633) HiveContext throws TreeNode Exception : Failed to Copy Node
[ https://issues.apache.org/jira/browse/SPARK-11633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-11633: Comment: was deleted (was: Which version are you using? I did hit an error, but it is a different error: Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'F1' given input columns keyCol1, keyCol2; line 1 pos 7 This is a self join issue. I will try to investigate the root cause. Thanks! ) > HiveContext throws TreeNode Exception : Failed to Copy Node > --- > > Key: SPARK-11633 > URL: https://issues.apache.org/jira/browse/SPARK-11633 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.0, 1.5.1 >Reporter: Saurabh Santhosh >Priority: Critical > > h2. HiveContext#sql is throwing the following exception in a specific > scenario : > h2. Exception : > Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > Failed to copy node. > Is otherCopyArgs specified correctly for LogicalRDD. > Exception message: wrong number of arguments > ctor: public org.apache.spark.sql.execution.LogicalRDD > (scala.collection.Seq,org.apache.spark.rdd.RDD,org.apache.spark.sql.SQLContext)? > h2. Code : > {code:title=SparkClient.java|borderStyle=solid} > StructField[] fields = new StructField[2]; > fields[0] = new StructField("F1", DataTypes.StringType, true, > Metadata.empty()); > fields[1] = new StructField("F2", DataTypes.StringType, true, > Metadata.empty()); > > JavaRDD rdd = > javaSparkContext.parallelize(Arrays.asList(RowFactory.create("", ""))); > DataFrame df = sparkHiveContext.createDataFrame(rdd, new StructType(fields)); > sparkHiveContext.registerDataFrameAsTable(df, "t1"); > DataFrame aliasedDf = sparkHiveContext.sql("select f1, F2 as F2 from t1"); > sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t2"); > sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t3"); > sparkHiveContext.sql("select a.F1 from t2 a inner join t3 b on a.F2=b.F2"); > {code} > h2. Observations : > * if F1(exact name of field) is used instead of f1, the code works correctly. > * If alias is not used for F2, then also code works irrespective of case of > F1. > * if Field F2 is not used in the final query also the code works correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-9603) Re-enable complex R package test in SparkSubmitSuite
[ https://issues.apache.org/jira/browse/SPARK-9603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman reopened SPARK-9603: -- We still have some failures on Jenkins as reported in https://github.com/apache/spark/pull/9390#issuecomment-157160063 and https://gist.github.com/shivaram/3a2fecce60768a603dac > Re-enable complex R package test in SparkSubmitSuite > > > Key: SPARK-9603 > URL: https://issues.apache.org/jira/browse/SPARK-9603 > Project: Spark > Issue Type: Test > Components: Deploy, SparkR, Tests >Affects Versions: 1.5.0 >Reporter: Burak Yavuz >Assignee: Sun Rui > Fix For: 1.6.0 > > > For building complex Spark Packages that contain R code in addition to Scala, > we have a complex procedure, where R source code is shipped inside a jar. The > source code is extracted, built, and is added as a library among SparkR. > The end to end test in SparkSubmitSuite ("correctly builds R packages > included in a jar with --packages") can't run on Jenkins now, because the > pull request builder is not built with SparkR. Once the PR Builder is built > with SparkR, we should re-enable the test. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11742) Show batch failures in the Streaming UI landing page
[ https://issues.apache.org/jira/browse/SPARK-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-11742. --- Resolution: Fixed Fix Version/s: 1.6.0 > Show batch failures in the Streaming UI landing page > > > Key: SPARK-11742 > URL: https://issues.apache.org/jira/browse/SPARK-11742 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11259) Params.validateParams() should be called automatically
[ https://issues.apache.org/jira/browse/SPARK-11259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11259: -- Target Version/s: 1.6.1, 1.7.0 > Params.validateParams() should be called automatically > -- > > Key: SPARK-11259 > URL: https://issues.apache.org/jira/browse/SPARK-11259 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Params.validateParams() can not be called automatically currently. Such as > the following code snippet will not throw exception which is not as expected. > {code} > val df = sqlContext.createDataFrame( > Seq( > (1, Vectors.dense(0.0, 1.0, 4.0), 1.0), > (2, Vectors.dense(1.0, 0.0, 4.0), 2.0), > (3, Vectors.dense(1.0, 0.0, 5.0), 3.0), > (4, Vectors.dense(0.0, 0.0, 5.0), 4.0)) > ).toDF("id", "features", "label") > val scaler = new MinMaxScaler() > .setInputCol("features") > .setOutputCol("features_scaled") > .setMin(10) > .setMax(0) > val pipeline = new Pipeline().setStages(Array(scaler)) > pipeline.fit(df) > {code} > validateParams() should be called by > PipelineStage(Pipeline/Estimator/Transformer) automatically, so I propose to > put it in transformSchema(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11553) row.getInt(i) if row[i]=null returns 0
[ https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11553. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9642 [https://github.com/apache/spark/pull/9642] > row.getInt(i) if row[i]=null returns 0 > -- > > Key: SPARK-11553 > URL: https://issues.apache.org/jira/browse/SPARK-11553 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Tofigh >Priority: Blocker > Fix For: 1.6.0 > > > row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even > according to the document they should throw nullException error) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps
[ https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007640#comment-15007640 ] Charles Allen commented on SPARK-11016: --- [~davies] Was in a meeting, looks like you got it :) > Spark fails when running with a task that requires a more recent version of > RoaringBitmaps > -- > > Key: SPARK-11016 > URL: https://issues.apache.org/jira/browse/SPARK-11016 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Charles Allen > Fix For: 1.6.0 > > > The following error appears during Kryo init whenever a more recent version > (>0.5.0) of Roaring bitmaps is required by a job. > org/roaringbitmap/RoaringArray$Element was removed in 0.5.0 > {code} > A needed class was not found. This could be due to an error in your runpath. > Missing class: org/roaringbitmap/RoaringArray$Element > java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element > at > org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338) > at > org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala) > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237) > at > org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222) > at > org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138) > at > org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) > at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) > at org.apache.spark.SparkContext.textFile(SparkContext.scala:816) > {code} > See https://issues.apache.org/jira/browse/SPARK-5949 for related info -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11766) JSON serialization of Vectors
[ https://issues.apache.org/jira/browse/SPARK-11766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11766: Assignee: Xiangrui Meng (was: Apache Spark) > JSON serialization of Vectors > - > > Key: SPARK-11766 > URL: https://issues.apache.org/jira/browse/SPARK-11766 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We want to support JSON serialization of vectors in order to support > SPARK-11764. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11742) Show batch failures in the Streaming UI landing page
[ https://issues.apache.org/jira/browse/SPARK-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-11742: -- Assignee: Shixiong Zhu > Show batch failures in the Streaming UI landing page > > > Key: SPARK-11742 > URL: https://issues.apache.org/jira/browse/SPARK-11742 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11762) TransportResponseHandler should consider open streams when counting outstanding requests
[ https://issues.apache.org/jira/browse/SPARK-11762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007604#comment-15007604 ] Apache Spark commented on SPARK-11762: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/9747 > TransportResponseHandler should consider open streams when counting > outstanding requests > > > Key: SPARK-11762 > URL: https://issues.apache.org/jira/browse/SPARK-11762 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Minor > > This code in TransportResponseHandler: > {code} > public int numOutstandingRequests() { > return outstandingFetches.size() + outstandingRpcs.size(); > } > {code} > Is used to determine if the channel is currently in use; if there's a timeout > and the channel is in use, then the channel is closed. But it currently does > not consider open streams (just block fetches and RPCs), so if a timeout > happens during a stream transfer, the channel will remain open. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11762) TransportResponseHandler should consider open streams when counting outstanding requests
[ https://issues.apache.org/jira/browse/SPARK-11762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11762: Assignee: (was: Apache Spark) > TransportResponseHandler should consider open streams when counting > outstanding requests > > > Key: SPARK-11762 > URL: https://issues.apache.org/jira/browse/SPARK-11762 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Minor > > This code in TransportResponseHandler: > {code} > public int numOutstandingRequests() { > return outstandingFetches.size() + outstandingRpcs.size(); > } > {code} > Is used to determine if the channel is currently in use; if there's a timeout > and the channel is in use, then the channel is closed. But it currently does > not consider open streams (just block fetches and RPCs), so if a timeout > happens during a stream transfer, the channel will remain open. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11762) TransportResponseHandler should consider open streams when counting outstanding requests
[ https://issues.apache.org/jira/browse/SPARK-11762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11762: Assignee: Apache Spark > TransportResponseHandler should consider open streams when counting > outstanding requests > > > Key: SPARK-11762 > URL: https://issues.apache.org/jira/browse/SPARK-11762 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > This code in TransportResponseHandler: > {code} > public int numOutstandingRequests() { > return outstandingFetches.size() + outstandingRpcs.size(); > } > {code} > Is used to determine if the channel is currently in use; if there's a timeout > and the channel is in use, then the channel is closed. But it currently does > not consider open streams (just block fetches and RPCs), so if a timeout > happens during a stream transfer, the channel will remain open. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11725) Let UDF to handle null value
[ https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-11725: - Target Version/s: 1.6.0 Priority: Blocker (was: Major) Issue Type: Bug (was: Improvement) > Let UDF to handle null value > > > Key: SPARK-11725 > URL: https://issues.apache.org/jira/browse/SPARK-11725 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Jeff Zhang >Priority: Blocker > > I notice that currently spark will take the long field as -1 if it is null. > Here's the sample code. > {code} > sqlContext.udf.register("f", (x:Int)=>x+1) > df.withColumn("age2", expr("f(age)")).show() > Output /// > ++---++ > | age| name|age2| > ++---++ > |null|Michael| 0| > | 30| Andy| 31| > | 19| Justin| 20| > ++---++ > {code} > I think for the null value we have 3 options > * Use a special value to represent it (what spark does now) > * Always return null if the udf input has null value argument > * Let udf itself to handle null > I would prefer the third option -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11763) Refactoring to create template for Estimator, Model pairs
Joseph K. Bradley created SPARK-11763: - Summary: Refactoring to create template for Estimator, Model pairs Key: SPARK-11763 URL: https://issues.apache.org/jira/browse/SPARK-11763 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Add save/load to LogisticRegression Estimator, and refactor tests a little to make it easier to add similar support to other Estimator, Model pairs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org