date:20151116

[jira] [Created] (SPARK-11758) Missing Index column while creating a DataFrame from Pandas

2015-11-16 Thread Leandro Ferrado (JIRA)

Leandro Ferrado created SPARK-11758:
---

 Summary: Missing Index column while creating a DataFrame from 
Pandas 
 Key: SPARK-11758
 URL: https://issues.apache.org/jira/browse/SPARK-11758
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.1
 Environment: Linux Debian, PySpark, in local testing.
Reporter: Leandro Ferrado
Priority: Minor


In PySpark's SQLContext, when it invokes createDataFrame() from a 
pandas.DataFrame and indicating a 'schema' with StructFields, the function 
_createFromLocal() converts the pandas.DataFrame but ignoring two points:
- Index column, because the flag index=False
- Timestamp's records, because a Date column can't be index and Pandas doesn't 
converts its records in Timestamp's type.
So, converting a DataFrame from Pandas to SQL is poor in scenarios with 
temporal records.

Doc: 
http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html

Affected code:

def _createFromLocal(self, data, schema):
"""
Create an RDD for DataFrame from an list or pandas.DataFrame, returns
the RDD and schema.
"""
if has_pandas and isinstance(data, pandas.DataFrame):
if schema is None:
schema = [str(x) for x in data.columns]
data = [r.tolist() for r in data.to_records(index=False)]  # HERE
# ...




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11759) Spark task on mesos with docker fails with sh: 1: /opt/spark/bin/spark-class: not found

2015-11-16 Thread Luis Alves (JIRA)

Luis Alves created SPARK-11759:
--

 Summary: Spark task on mesos with docker fails with sh: 1: 
/opt/spark/bin/spark-class: not found
 Key: SPARK-11759
 URL: https://issues.apache.org/jira/browse/SPARK-11759
 Project: Spark
  Issue Type: Question
Reporter: Luis Alves


I'm using Spark 1.5.1 and Mesos 0.25 in cluster mode. I've the spark-dispatcher 
running, and run spark-submit. The driver is launched, but it fails because it 
seems that the task it launches fails.
In the logs of the launched task I can see the following error: 
sh: 1: /opt/spark/bin/spark-class: not found

I checked my docker image and the  /opt/spark/bin/spark-class exists. I then 
noticed that it's using sh, therefore I tried to run (in the docker image) the 
following:
sh /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master

It fails with the following error:
spark-class: 73: spark-class: Syntax error: "(" unexpected

Is this an error in Spark?

Thanks




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11202) Unsupported dataType

2015-11-16 Thread F Jimenez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006897#comment-15006897
 ] 

F Jimenez commented on SPARK-11202:
---

I have noticed the following commit that may have solved the problem

https://github.com/apache/spark/commit/02149ff08eed3745086589a047adbce9a580389f

> Unsupported dataType
> 
>
> Key: SPARK-11202
> URL: https://issues.apache.org/jira/browse/SPARK-11202
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: whc
>
> I read data from oracle and save as parquet ,then get the following error:
> java.lang.IllegalArgumentException: Unsupported dataType: 
> {"type":"struct","fields":[{"name":"DOMAIN_NAME","type":"string","nullable":true,"metadata":{"name":"DOMAIN_NAME"}},{"name":"DOMAIN_ID","type":"decimal(0,-127)","nullable":true,"metadata":{"name":"DOMAIN_ID"}}]},
>  [1.1] failure: `TimestampType' expected but `{' found
> {"type":"struct","fields":[{"name":"DOMAIN_NAME","type":"string","nullable":true,"metadata":{"name":"DOMAIN_NAME"}},{"name":"DOMAIN_ID","type":"decimal(0,-127)","nullable":true,"metadata":{"name":"DOMAIN_ID"}}]}
> ^
> at 
> org.apache.spark.sql.types.DataType$CaseClassStringParser$.apply(DataType.scala:245)
> at 
> org.apache.spark.sql.types.DataType$.fromCaseClassString(DataType.scala:102)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$$anonfun$3.apply(ParquetTypesConverter.scala:62)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$$anonfun$3.apply(ParquetTypesConverter.scala:62)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromString(ParquetTypesConverter.scala:62)
> at 
> org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:51)
> at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
> at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:234)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> I checked the type but do not have Timestamp or Date type in oracle
> my oracle table like this:
> create table DW_DOMAIN
> (
> domain_id   NUMBER,
>   cityid  NUMBER,
>   domain_type NUMBER,
>   domain_name VARCHAR2(80)
> )
> and my code like this:
> Map options = new HashMap();
>   options.put("url", url);
>   options.put("driver", driver);
>   options.put("user", user);
>   options.put("password", password);
>   options.put("dbtable", "(select DOMAIN_NAME,DOMAIN_ID from 
> dw_domain ) t");
> DataFrame df = this.sqlContext.read().format("jdbc").options(options )
>   .load();
> df.write().mode(SaveMode.Append)
>   
> .parquet("hdfs://cluster1:8020/database/count_domain/");
> if add "to_char(DOMAIN_ID)",that can get correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11752) fix timezone problem for DateTimeUtils.getSeconds

2015-11-16 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-11752:
---
Fix Version/s: (was: 1.5.2)
   1.5.3

> fix timezone problem for DateTimeUtils.getSeconds
> -
>
> Key: SPARK-11752
> URL: https://issues.apache.org/jira/browse/SPARK-11752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.5.3, 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11665) Support other distance metrics for bisecting k-means

2015-11-16 Thread Jun Zheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006793#comment-15006793
 ] 

Jun Zheng commented on SPARK-11665:
---

If no one else is interested, can you assign to me?

> Support other distance metrics for bisecting k-means
> 
>
> Key: SPARK-11665
> URL: https://issues.apache.org/jira/browse/SPARK-11665
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Yu Ishikawa
>Priority: Minor
>
> Some guys reqested me to support other distance metrics, such as cosine 
> distance, tanimoto distance, in bisecting k-means. 
> We should
> - desing the interfaces for distance metrics
> - support the distances



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11572) Exit AsynchronousListenerBus thread when stop() is called

2015-11-16 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu resolved SPARK-11572.

Resolution: Won't Fix

> Exit AsynchronousListenerBus thread when stop() is called
> -
>
> Key: SPARK-11572
> URL: https://issues.apache.org/jira/browse/SPARK-11572
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ted Yu
>
> As vonnagy reported in the following thread:
> http://search-hadoop.com/m/q3RTtk982kvIow22
> Attempts to join the thread in AsynchronousListenerBus resulted in lock up 
> because AsynchronousListenerBus thread was still getting messages 
> SparkListenerExecutorMetricsUpdate from the DAGScheduler
> Proposed fix is to check stopped flag within the loop of 
> AsynchronousListenerBus thread



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11522) input_file_name() returns "" for external tables

2015-11-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11522.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9542
[https://github.com/apache/spark/pull/9542]

> input_file_name() returns "" for external tables
> 
>
> Key: SPARK-11522
> URL: https://issues.apache.org/jira/browse/SPARK-11522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Simeon Simeonov
>  Labels: external-tables, hive, sql
> Fix For: 1.6.0
>
>
> Given an external table definition where the data consists of many CSV files, 
> {{input_file_name()}} returns empty strings.
> Table definition:
> {code}
> CREATE EXTERNAL TABLE external_test(page_id INT, impressions INT) 
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> WITH SERDEPROPERTIES (
>"separatorChar" = ",",
>"quoteChar" = "\"",
>"escapeChar"= "\\"
> )  
> LOCATION 'file:///Users/sim/spark/test/external_test'
> {code}
> Query: 
> {code}
> sql("SELECT input_file_name() as file FROM external_test").show
> {code}
> Output:
> {code}
> ++
> |file|
> ++
> ||
> ||
> ...
> ||
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11743) Add UserDefinedType support to RowEncoder

2015-11-16 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11743.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9712
[https://github.com/apache/spark/pull/9712]

> Add UserDefinedType support to RowEncoder
> -
>
> Key: SPARK-11743
> URL: https://issues.apache.org/jira/browse/SPARK-11743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
> Fix For: 1.6.0
>
>
> RowEncoder doesn't support UserDefinedType now. We should add the support for 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10935) Avito Context Ad Clicks

2015-11-16 Thread Kristina Plazonic (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006824#comment-15006824
 ] 

Kristina Plazonic commented on SPARK-10935:
---

[~xusen] Thanks for pinging. Yes, I think I resolved it - I disabled Tungsten 
and it made the error go away (in a smaller case). 

However, I think I'm being super inefficient when generating the features for 
this problem - because of all the joins. Do you have any pointers on that?

[~mengxr], I think it would really help data scientists to have a small 
document - guide for feature assembly in Spark - what to do and what not to do 
when using joins, especially if using ML i.e. DataFrames. I spent an inordinate 
amount of time on that, and I'm still confused!!! 

For example, should I use DataFrames at all when doing joins? Is it better to 
use RDDs, because you can partition RDDs by keys, but not DataFrames (e.g. in 
this example every join is by UserID, and you have 4 million users, so if you 
had partitioned dataframes by UserID, every join would be local)? 

Another example, when I started seeing the memory errors with joins, I started 
asking myself if a whole DataFrame passed into a function is included in a 
closure of function and a copy shipped off with every task, or does Spark take 
account of the fact that whatever is passed as an argument of a function is a 
distributed object and only a reference to every partition of the object is 
passed in? I still don't really know for sure. All examples on the Spark 
website and docs and even books are for scripts, not functions with RDD or 
DataFrame arguments. 

Thanks for any insights... 



> Avito Context Ad Clicks
> ---
>
> Key: SPARK-10935
> URL: https://issues.apache.org/jira/browse/SPARK-10935
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>
> From [~kpl...@gmail.com]:
> I would love to do Avito Context Ad Clicks - 
> https://www.kaggle.com/c/avito-context-ad-clicks - but it involves a lot of 
> feature engineering and preprocessing. I would love to split this with 
> somebody else if anybody is interested on working with this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11752) fix timezone problem for DateTimeUtils.getSeconds

2015-11-16 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11752.

   Resolution: Fixed
Fix Version/s: 1.5.2
   1.6.0

Issue resolved by pull request 9728
[https://github.com/apache/spark/pull/9728]

> fix timezone problem for DateTimeUtils.getSeconds
> -
>
> Key: SPARK-11752
> URL: https://issues.apache.org/jira/browse/SPARK-11752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0, 1.5.2
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8332) NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer

2015-11-16 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006828#comment-15006828
 ] 

Pedro Vilaça commented on SPARK-8332:
-

We're facing the same problem with a spark streaming job and I noticed that 
this issue was closed.

Don't you have a plan to upgrade the jackson version that is being used?

> NoSuchMethodError: 
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
> --
>
> Key: SPARK-8332
> URL: https://issues.apache.org/jira/browse/SPARK-8332
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
> Environment: spark 1.4 & hadoop 2.3.0-cdh5.0.0
>Reporter: Tao Li
>Priority: Critical
>  Labels: 1.4.0, NoSuchMethodError, com.fasterxml.jackson
>
> I complied new spark 1.4.0 version. 
> But when I run a simple WordCount demo, it throws NoSuchMethodError 
> {code}
> java.lang.NoSuchMethodError: 
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
> {code}
> I found out that the default "fasterxml.jackson.version" is 2.4.4. 
> Is there any wrong or conflict with the jackson version? 
> Or is there possibly some project maven dependency containing the wrong 
> version of jackson?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11522) input_file_name() returns "" for external tables

2015-11-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11522:
-
Assignee: Xin Wu

> input_file_name() returns "" for external tables
> 
>
> Key: SPARK-11522
> URL: https://issues.apache.org/jira/browse/SPARK-11522
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Simeon Simeonov
>Assignee: Xin Wu
>  Labels: external-tables, hive, sql
> Fix For: 1.6.0
>
>
> Given an external table definition where the data consists of many CSV files, 
> {{input_file_name()}} returns empty strings.
> Table definition:
> {code}
> CREATE EXTERNAL TABLE external_test(page_id INT, impressions INT) 
> ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> WITH SERDEPROPERTIES (
>"separatorChar" = ",",
>"quoteChar" = "\"",
>"escapeChar"= "\\"
> )  
> LOCATION 'file:///Users/sim/spark/test/external_test'
> {code}
> Query: 
> {code}
> sql("SELECT input_file_name() as file FROM external_test").show
> {code}
> Output:
> {code}
> ++
> |file|
> ++
> ||
> ||
> ...
> ||
> ++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11700) Memory leak at SparkContext jobProgressListener stageIdToData map

2015-11-16 Thread Kostas papageorgopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006895#comment-15006895
 ] 

Kostas papageorgopoulos commented on SPARK-11700:
-

One workaround to minimize the effect is to Keep the JavaSparkContext forever 
alive. (Never stop it inside a JVM process that is long running ) and configure 
the following options {code} 
spark.ui.retainedJobs   1000How many jobs the Spark UI and status APIs 
remember before garbage collecting.
spark.ui.retainedStages 1000 How many stages the Spark UI and status APIs 
remember before garbage collecting.
spark.worker.ui.retainedExecutors   1000How many finished executors the 
Spark UI and status APIs remember before garbage collecting.
spark.worker.ui.retainedDrivers 1000How many finished drivers the Spark UI 
and status APIs remember before garbage collecting.
spark.sql.ui.retainedExecutions 1000How many finished executions the Spark 
UI and status APIs remember before garbage collecting.
spark.streaming.ui.retainedBatches  1000How many finished batches the 
Spark UI and status APIs remember before garbage collecting.{code}  to very 
small numbers  in order to have the {code}JobProgressListener{code} relevant 
maps cleaned. 

> Memory leak at SparkContext jobProgressListener stageIdToData map
> -
>
> Key: SPARK-11700
> URL: https://issues.apache.org/jira/browse/SPARK-11700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: Ubuntu 14.04 LTS, Oracle JDK 1.8.51 Apache tomcat 
> 8.0.28. Spring 4
>Reporter: Kostas papageorgopoulos
>Priority: Minor
>  Labels: leak, memory-leak
> Attachments: AbstractSparkJobRunner.java, 
> SparkContextPossibleMemoryLeakIDEA_DEBUG.png, SparkHeapSpaceProgress.png, 
> SparkMemoryAfterLotsOfConsecutiveRuns.png, 
> SparkMemoryLeakAfterLotsOfRunsWithinTheSameContext.png
>
>
> it seems that there is  A SparkContext jobProgressListener memory leak.*. 
> Bellow i describe the  steps i do to reproduce that. 
> I have created a java webapp trying to abstractly Run some Spark Sql jobs 
> that read data from HDFS (join them) and Write them To ElasticSearch using ES 
> hadoop connector. After a Lot of consecutive runs  i noticed that my heap 
> space was full so i got an out of heap space error.
> At the attached file {code} AbstractSparkJobRunner {code} the {code}  public 
> final void run(T jobConfiguration, ExecutionLog executionLog) throws 
> Exception  {code} runs each time an Spark Sql Job is triggered.  So tried to 
> reuse the same SparkContext for a number of consecutive runs. If some rules 
> apply i try to clean up the SparkContext by first calling {code} 
> killSparkAndSqlContext {code}. This code eventually runs {code}  synchronized 
> (sparkContextThreadLock) {
> if (javaSparkContext != null) {
> LOGGER.info("!!! CLEARING SPARK 
> CONTEXT!!!");
> javaSparkContext.stop();
> javaSparkContext = null;
> sqlContext = null;
> System.gc();
> }
> numberOfRunningJobsForSparkContext.getAndSet(0);
> }
> {code}.
> So at some point in time i suppose that if no other SparkSql job should run i 
> should kill the sparkContext  (The 
> AbstractSparkJobRunner.killSparkAndSqlContext  runs) and this should be 
> garbage collected from garbage collector. However this is not the case, Even 
> if in my debugger shows that my JavaSparkContext object is null see attached 
> picture {code} SparkContextPossibleMemoryLeakIDEA_DEBUG.png {code}.
> The jvisual vm shows an incremental heap space even when the garbage 
> collector is called. See attached picture {code} SparkHeapSpaceProgress.png 
> {code}.
> The memory analyser Tool shows that a big part of the retained heap to be 
> assigned to _jobProgressListener see attached picture {code} 
> SparkMemoryAfterLotsOfConsecutiveRuns.png {code}  and summary picture {code} 
> SparkMemoryLeakAfterLotsOfRunsWithinTheSameContext.png {code}. Although at 
> the same time in Singleton Service the JavaSparkContext is null.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11760) SQL Catalyst data time test fails

2015-11-16 Thread JIRA

Jean-Baptiste Onofré created SPARK-11760:


 Summary: SQL Catalyst data time test fails
 Key: SPARK-11760
 URL: https://issues.apache.org/jira/browse/SPARK-11760
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Jean-Baptiste Onofré


In the sql/catalyst module, test("hours / minute / seconds") fails on the third 
test data:

{code}
- hours / miniute / seconds *** FAILED ***
  29 did not equal 50 (DateTimeUtilsSuite.scala:370)
{code}

Actually, the problem is that it doesn't use the timezone for seconds, so, we 
may have to different timestamp comparison.

I will submit a PR to fix that in DateTimeUtils.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11044) Parquet writer version fixed as version1

2015-11-16 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-11044.

   Resolution: Fixed
Fix Version/s: 1.7.0

Issue resolved by pull request 9060
[https://github.com/apache/spark/pull/9060]

> Parquet writer version fixed as version1
> 
>
> Key: SPARK-11044
> URL: https://issues.apache.org/jira/browse/SPARK-11044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 1.7.0
>
>
> Spark only writes the parquet files with writer version1 ignoring given 
> configuration.
> It should let users choose the writer version. (remaining the default as 
> version1).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006672#comment-15006672
 ] 

Apache Spark commented on SPARK-11191:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/9737

> [1.5] Can't create UDF's using hive thrift service
> --
>
> Key: SPARK-11191
> URL: https://issues.apache.org/jira/browse/SPARK-11191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: David Ross
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.5.3, 1.6.0
>
>
> Since upgrading to spark 1.5 we've been unable to create and use UDF's when 
> we run in thrift server mode.
> Our setup:
> We start the thrift-server running against yarn in client mode, (we've also 
> built our own spark from github branch-1.5 with the following args: {{-Pyarn 
> -Phive -Phive-thrifeserver}}
> If i run the following after connecting via JDBC (in this case via beeline):
> {{add jar 'hdfs://path/to/jar"}}
> (this command succeeds with no errors)
> {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}}
> (this command succeeds with no errors)
> {{select testUDF(col1) from table1;}}
> I get the following error in the logs:
> {code}
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 8
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
> {code}
> (cutting the bulk for ease of report, more than happy to send the full output)
> {code}
> 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 100
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
>

[jira] [Updated] (SPARK-11757) Incorrect join output for joining two dataframes loaded from Parquet format

2015-11-16 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-11757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Petri Kärkäs updated SPARK-11757:
-
Description: 
Reading in dataframes from Parquet format in s3, and executing a join between 
them fails when evoked by column name. Works correctly if a join condition is 
used instead:

{code:none}
sqlContext = SQLContext(sc)
a = sqlContext.read.parquet('s3://path-to-data-a/')
b = sqlContext.read.parquet('s3://path-to-data-b/')

# result 0 rows
c = a.join(b, on='id', how='left_outer')
c.count() 

# correct output
d = a.join(b, a['id']==b['id'], how='left_outer')
d.count() 
{code}

  was:
Reading in dataframes from Parquet format in s3, and executing a join between 
them fails when evoked by column name. Works correctly if a join condition is 
used instead:

sqlContext = SQLContext(sc)
a = sqlContext.read.parquet('s3://path-to-data-a/')
b = sqlContext.read.parquet('s3://path-to-data-b/')

# results 0 rows
c = a.join(b, on='id', how='left_outer')
c.count() 

# correct result
d = a.join(b, a['id']==b['id'], how='left_outer')
d.count()  


> Incorrect join output for joining two dataframes loaded from Parquet format
> ---
>
> Key: SPARK-11757
> URL: https://issues.apache.org/jira/browse/SPARK-11757
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
> Environment: Python 2.7, Spark 1.5.0, Amazon linux ami 
> https://aws.amazon.com/amazon-linux-ami/2015.03-release-notes/
>Reporter: Petri Kärkäs
>  Labels: dataframe, emr, join, pyspark
>
> Reading in dataframes from Parquet format in s3, and executing a join between 
> them fails when evoked by column name. Works correctly if a join condition is 
> used instead:
> {code:none}
> sqlContext = SQLContext(sc)
> a = sqlContext.read.parquet('s3://path-to-data-a/')
> b = sqlContext.read.parquet('s3://path-to-data-b/')
> # result 0 rows
> c = a.join(b, on='id', how='left_outer')
> c.count() 
> # correct output
> d = a.join(b, a['id']==b['id'], how='left_outer')
> d.count() 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11530) Return eigenvalues with PCA model

2015-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006654#comment-15006654
 ] 

Apache Spark commented on SPARK-11530:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9736

> Return eigenvalues with PCA model
> -
>
> Key: SPARK-11530
> URL: https://issues.apache.org/jira/browse/SPARK-11530
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.1
>Reporter: Christos Iraklis Tsatsoulis
>
> For data scientists & statisticians, PCA is of little use if they cannot 
> estimate the _proportion of variance explained_ by selecting _k_ principal 
> components (see here for the math details: 
> https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section 
> 'Explained variance'). To estimate this, one only needs the eigenvalues of 
> the covariance matrix.
> Although the eigenvalues are currently computed during PCA model fitting, 
> they are not _returned_; hence, as it stands now, PCA in Spark ML is of 
> extremely limited practical use.
> For details, see these SO questions
> http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/
>  (pyspark)
> http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala)
> and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11530) Return eigenvalues with PCA model

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11530:


Assignee: Apache Spark

> Return eigenvalues with PCA model
> -
>
> Key: SPARK-11530
> URL: https://issues.apache.org/jira/browse/SPARK-11530
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.1
>Reporter: Christos Iraklis Tsatsoulis
>Assignee: Apache Spark
>
> For data scientists & statisticians, PCA is of little use if they cannot 
> estimate the _proportion of variance explained_ by selecting _k_ principal 
> components (see here for the math details: 
> https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section 
> 'Explained variance'). To estimate this, one only needs the eigenvalues of 
> the covariance matrix.
> Although the eigenvalues are currently computed during PCA model fitting, 
> they are not _returned_; hence, as it stands now, PCA in Spark ML is of 
> extremely limited practical use.
> For details, see these SO questions
> http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/
>  (pyspark)
> http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala)
> and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11530) Return eigenvalues with PCA model

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11530:


Assignee: (was: Apache Spark)

> Return eigenvalues with PCA model
> -
>
> Key: SPARK-11530
> URL: https://issues.apache.org/jira/browse/SPARK-11530
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.5.1
>Reporter: Christos Iraklis Tsatsoulis
>
> For data scientists & statisticians, PCA is of little use if they cannot 
> estimate the _proportion of variance explained_ by selecting _k_ principal 
> components (see here for the math details: 
> https://inst.eecs.berkeley.edu/~ee127a/book/login/l_sym_pca.html , section 
> 'Explained variance'). To estimate this, one only needs the eigenvalues of 
> the covariance matrix.
> Although the eigenvalues are currently computed during PCA model fitting, 
> they are not _returned_; hence, as it stands now, PCA in Spark ML is of 
> extremely limited practical use.
> For details, see these SO questions
> http://stackoverflow.com/questions/33428589/pyspark-and-pca-how-can-i-extract-the-eigenvectors-of-this-pca-how-can-i-calcu/
>  (pyspark)
> http://stackoverflow.com/questions/33559599/spark-pca-top-components (Scala)
> and this blog post http://www.nodalpoint.com/pca-in-spark-1-5/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11757) Incorrect join output for joining two dataframes loaded from Parquet format

2015-11-16 Thread JIRA

Petri Kärkäs created SPARK-11757:


 Summary: Incorrect join output for joining two dataframes loaded 
from Parquet format
 Key: SPARK-11757
 URL: https://issues.apache.org/jira/browse/SPARK-11757
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.0
 Environment: Python 2.7, Spark 1.5.0, Amazon linux ami 
https://aws.amazon.com/amazon-linux-ami/2015.03-release-notes/
Reporter: Petri Kärkäs


Reading in dataframes from Parquet format in s3, and executing a join between 
them fails when evoked by column name. Works correctly if a join condition is 
used instead:

sqlContext = SQLContext(sc)
a = sqlContext.read.parquet('s3://path-to-data-a/')
b = sqlContext.read.parquet('s3://path-to-data-b/')

# results 0 rows
c = a.join(b, on='id', how='left_outer')
c.count() 

# correct result
d = a.join(b, a['id']==b['id'], how='left_outer')
d.count()  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11692) Support for Parquet logical types, JSON and BSON (embedded types)

2015-11-16 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-11692.

   Resolution: Fixed
Fix Version/s: 1.7.0

Issue resolved by pull request 9658
[https://github.com/apache/spark/pull/9658]

> Support for Parquet logical types, JSON and BSON (embedded types) 
> --
>
> Key: SPARK-11692
> URL: https://issues.apache.org/jira/browse/SPARK-11692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 1.7.0
>
>
> Add support for Parquet logical types JSON and BSON.
> Since JSON is represented as UTF-8 and BSON is binary.
> {code}
> org.apache.spark.sql.AnalysisException: Illegal Parquet type: BINARY (BSON);
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.illegalType$1(CatalystSchemaConverter.scala:118)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertPrimitiveField(CatalystSchemaConverter.scala:177)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$2.apply(CatalystSchemaConverter.scala:82)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$2.apply(CatalystSchemaConverter.scala:76)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11044) Parquet writer version fixed as version1

2015-11-16 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11044:
---
Assignee: Hyukjin Kwon

> Parquet writer version fixed as version1
> 
>
> Key: SPARK-11044
> URL: https://issues.apache.org/jira/browse/SPARK-11044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 1.7.0
>
>
> Spark only writes the parquet files with writer version1 ignoring given 
> configuration.
> It should let users choose the writer version. (remaining the default as 
> version1).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11692) Support for Parquet logical types, JSON and BSON (embedded types)

2015-11-16 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11692:
---
Assignee: Hyukjin Kwon

> Support for Parquet logical types, JSON and BSON (embedded types) 
> --
>
> Key: SPARK-11692
> URL: https://issues.apache.org/jira/browse/SPARK-11692
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>
> Add support for Parquet logical types JSON and BSON.
> Since JSON is represented as UTF-8 and BSON is binary.
> {code}
> org.apache.spark.sql.AnalysisException: Illegal Parquet type: BINARY (BSON);
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.illegalType$1(CatalystSchemaConverter.scala:118)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertPrimitiveField(CatalystSchemaConverter.scala:177)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:100)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$2.apply(CatalystSchemaConverter.scala:82)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.CatalystSchemaConverter$$anonfun$2.apply(CatalystSchemaConverter.scala:76)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10181) HiveContext is not used with keytab principal but with user principal/unix username

2015-11-16 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007071#comment-15007071
 ] 

Yin Huai commented on SPARK-10181:
--

[~bolke] I have merged it into branch-1.5. It will be release with 1.5.3.

> HiveContext is not used with keytab principal but with user principal/unix 
> username
> ---
>
> Key: SPARK-10181
> URL: https://issues.apache.org/jira/browse/SPARK-10181
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: kerberos
>Reporter: Bolke de Bruin
>Assignee: Yu Gao
>  Labels: hive, hivecontext, kerberos
> Fix For: 1.5.3, 1.6.0
>
>
> `bin/spark-submit --num-executors 1 --executor-cores 5 --executor-memory 5G  
> --driver-java-options -XX:MaxPermSize=4G --driver-class-path 
> lib/datanucleus-api-jdo-3.2.6.jar:lib/datanucleus-core-3.2.10.jar:lib/datanucleus-rdbms-3.2.9.jar:conf/hive-site.xml
>  --files conf/hive-site.xml --master yarn --principal sparkjob --keytab 
> /etc/security/keytabs/sparkjob.keytab --conf 
> spark.yarn.executor.memoryOverhead=18000 --conf 
> "spark.executor.extraJavaOptions=-XX:MaxPermSize=4G" --conf 
> spark.eventLog.enabled=false ~/test.py`
> With:
> #!/usr/bin/python
> from pyspark import SparkContext
> from pyspark.sql import HiveContext
> sc = SparkContext()
> sqlContext = HiveContext(sc)
> query = """ SELECT * FROM fm.sk_cluster """
> rdd = sqlContext.sql(query)
> rdd.registerTempTable("test")
> sqlContext.sql("CREATE TABLE wcs.test LOCATION '/tmp/test_gl' AS SELECT * 
> FROM test")
> Ends up with:
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  Permission denie
> d: user=ua80tl, access=READ_EXECUTE, 
> inode="/tmp/test_gl/.hive-staging_hive_2015-08-24_10-43-09_157_78057390024057878
> 34-1/-ext-1":sparkjob:hdfs:drwxr-x---
> (Our umask denies read access to other by default)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11760) SQL Catalyst data time test fails

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11760:


Assignee: (was: Apache Spark)

> SQL Catalyst data time test fails
> -
>
> Key: SPARK-11760
> URL: https://issues.apache.org/jira/browse/SPARK-11760
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jean-Baptiste Onofré
>
> In the sql/catalyst module, test("hours / minute / seconds") fails on the 
> third test data:
> {code}
> - hours / miniute / seconds *** FAILED ***
>   29 did not equal 50 (DateTimeUtilsSuite.scala:370)
> {code}
> Actually, the problem is that it doesn't use the timezone for seconds, so, we 
> may have to different timestamp comparison.
> I will submit a PR to fix that in DateTimeUtils.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11760) SQL Catalyst data time test fails

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11760:


Assignee: Apache Spark

> SQL Catalyst data time test fails
> -
>
> Key: SPARK-11760
> URL: https://issues.apache.org/jira/browse/SPARK-11760
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jean-Baptiste Onofré
>Assignee: Apache Spark
>
> In the sql/catalyst module, test("hours / minute / seconds") fails on the 
> third test data:
> {code}
> - hours / miniute / seconds *** FAILED ***
>   29 did not equal 50 (DateTimeUtilsSuite.scala:370)
> {code}
> Actually, the problem is that it doesn't use the timezone for seconds, so, we 
> may have to different timestamp comparison.
> I will submit a PR to fix that in DateTimeUtils.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11760) SQL Catalyst data time test fails

2015-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006932#comment-15006932
 ] 

Apache Spark commented on SPARK-11760:
--

User 'jbonofre' has created a pull request for this issue:
https://github.com/apache/spark/pull/9738

> SQL Catalyst data time test fails
> -
>
> Key: SPARK-11760
> URL: https://issues.apache.org/jira/browse/SPARK-11760
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jean-Baptiste Onofré
>
> In the sql/catalyst module, test("hours / minute / seconds") fails on the 
> third test data:
> {code}
> - hours / miniute / seconds *** FAILED ***
>   29 did not equal 50 (DateTimeUtilsSuite.scala:370)
> {code}
> Actually, the problem is that it doesn't use the timezone for seconds, so, we 
> may have to different timestamp comparison.
> I will submit a PR to fix that in DateTimeUtils.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11089) Add a option for thrift-server to share a single session across all connections

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11089:


Assignee: Cheng Lian  (was: Apache Spark)

> Add a option for thrift-server to share a single session across all 
> connections
> ---
>
> Key: SPARK-11089
> URL: https://issues.apache.org/jira/browse/SPARK-11089
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
>
> In 1.6, we improve the session support in JDBC server by separating temporary 
> tables and UDFs. In some cases, user may still want to share the temporary 
> tables or UDFs across different applications.
> We should have an option or config to support that (use the original 
> SQLContext instead of calling newSession if it's set to true).
> cc [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11089) Add a option for thrift-server to share a single session across all connections

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11089:


Assignee: Apache Spark  (was: Cheng Lian)

> Add a option for thrift-server to share a single session across all 
> connections
> ---
>
> Key: SPARK-11089
> URL: https://issues.apache.org/jira/browse/SPARK-11089
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> In 1.6, we improve the session support in JDBC server by separating temporary 
> tables and UDFs. In some cases, user may still want to share the temporary 
> tables or UDFs across different applications.
> We should have an option or config to support that (use the original 
> SQLContext instead of calling newSession if it's set to true).
> cc [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11089) Add a option for thrift-server to share a single session across all connections

2015-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007010#comment-15007010
 ] 

Apache Spark commented on SPARK-11089:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/9740

> Add a option for thrift-server to share a single session across all 
> connections
> ---
>
> Key: SPARK-11089
> URL: https://issues.apache.org/jira/browse/SPARK-11089
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
>
> In 1.6, we improve the session support in JDBC server by separating temporary 
> tables and UDFs. In some cases, user may still want to share the temporary 
> tables or UDFs across different applications.
> We should have an option or config to support that (use the original 
> SQLContext instead of calling newSession if it's set to true).
> cc [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11281) Issue with creating and collecting DataFrame using environments

2015-11-16 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007038#comment-15007038
 ] 

Maciej Szymkiewicz commented on SPARK-11281:


[~shivaram] I've tested both current master and my PR for [SPARK-11086] and it 
looks it is indeed resolved. I would like to add some tests but otherwise it 
looks like it can be closed.


> Issue with creating and collecting DataFrame using environments 
> 
>
> Key: SPARK-11281
> URL: https://issues.apache.org/jira/browse/SPARK-11281
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
> Environment: R 3.2.2, Spark build from master  
> 487d409e71767c76399217a07af8de1bb0da7aa8
>Reporter: Maciej Szymkiewicz
> Fix For: 1.6.0
>
>
> It is not possible to to access Map field created from an environment. 
> Assuming local data frame is created as follows:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3)))
> str(ldf)
> ## 'data.frame':  2 obs. of  1 variable:
> ##  $ x:List of 2
> ##   ..$ : 
> ##   ..$ : 
> get("a", ldf$x[[1]])
> ## [1] 1
> get("c", ldf$x[[2]])
> ## [1] 3
> {code}
> It is possible to create a Spark data frame:
> {code}
> sdf <- createDataFrame(sqlContext, ldf)
> printSchema(sdf)
> ## root
> ##  |-- x: array (nullable = true)
> ##  ||-- element: map (containsNull = true)
> ##  |||-- key: string
> ##  |||-- value: double (valueContainsNull = true)
> {code}
> but it throws:
> {code}
> java.lang.IllegalArgumentException: Invalid array type e
> {code}
> on collect / head. 
> Problem seems to be specific to environments and cannot be reproduced when 
> Map comes for example from Cassandra table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11760) SQL Catalyst data time test fails

2015-11-16 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-11760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jean-Baptiste Onofré resolved SPARK-11760.
--
Resolution: Invalid

It has already been fixed by:

{code}
commit 06f1fdba6d1425afddfc1d45a20dbe9bede15e7a
Author: Wenchen Fan 
Date:   Mon Nov 16 08:58:40 2015 -0800

[SPARK-11752] [SQL] fix timezone problem for DateTimeUtils.getSeconds

code snippet to reproduce it:
```
TimeZone.setDefault(TimeZone.getTimeZone("Asia/Shanghai"))
val t = Timestamp.valueOf("1900-06-11 12:14:50.789")
val us = fromJavaTimestamp(t)
assert(getSeconds(us) === t.getSeconds)
```

it will be good to add a regression test for it, but the reproducing code 
need to change the default timezone, and even we change it back, the `lazy val 
defaultTimeZone` in `DataTimeUtils` is fixed.

Author: Wenchen Fan 

Closes #9728 from cloud-fan/seconds.

{code}

> SQL Catalyst data time test fails
> -
>
> Key: SPARK-11760
> URL: https://issues.apache.org/jira/browse/SPARK-11760
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jean-Baptiste Onofré
>
> In the sql/catalyst module, test("hours / minute / seconds") fails on the 
> third test data:
> {code}
> - hours / miniute / seconds *** FAILED ***
>   29 did not equal 50 (DateTimeUtilsSuite.scala:370)
> {code}
> Actually, the problem is that it doesn't use the timezone for seconds, so, we 
> may have to different timestamp comparison.
> I will submit a PR to fix that in DateTimeUtils.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11759) Spark task on mesos with docker fails with sh: 1: /opt/spark/bin/spark-class: not found

2015-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11759:
--
Component/s: Mesos
 Deploy

> Spark task on mesos with docker fails with sh: 1: /opt/spark/bin/spark-class: 
> not found
> ---
>
> Key: SPARK-11759
> URL: https://issues.apache.org/jira/browse/SPARK-11759
> Project: Spark
>  Issue Type: Question
>  Components: Deploy, Mesos
>Reporter: Luis Alves
>
> I'm using Spark 1.5.1 and Mesos 0.25 in cluster mode. I've the 
> spark-dispatcher running, and run spark-submit. The driver is launched, but 
> it fails because it seems that the task it launches fails.
> In the logs of the launched task I can see the following error: 
> sh: 1: /opt/spark/bin/spark-class: not found
> I checked my docker image and the  /opt/spark/bin/spark-class exists. I then 
> noticed that it's using sh, therefore I tried to run (in the docker image) 
> the following:
> sh /opt/spark/bin/spark-class org.apache.spark.deploy.master.Master
> It fails with the following error:
> spark-class: 73: spark-class: Syntax error: "(" unexpected
> Is this an error in Spark?
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11281) Issue with creating and collecting DataFrame using environments

2015-11-16 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006966#comment-15006966
 ] 

Shivaram Venkataraman commented on SPARK-11281:
---

Does the example posted in the description work now or does it still not work ? 
Sorry I'm just confused what the resolution to this bug was (i.e. if it was 
fixed or we decided we won't fix etc.)

> Issue with creating and collecting DataFrame using environments 
> 
>
> Key: SPARK-11281
> URL: https://issues.apache.org/jira/browse/SPARK-11281
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
> Environment: R 3.2.2, Spark build from master  
> 487d409e71767c76399217a07af8de1bb0da7aa8
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
> Fix For: 1.6.0
>
>
> It is not possible to to access Map field created from an environment. 
> Assuming local data frame is created as follows:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3)))
> str(ldf)
> ## 'data.frame':  2 obs. of  1 variable:
> ##  $ x:List of 2
> ##   ..$ : 
> ##   ..$ : 
> get("a", ldf$x[[1]])
> ## [1] 1
> get("c", ldf$x[[2]])
> ## [1] 3
> {code}
> It is possible to create a Spark data frame:
> {code}
> sdf <- createDataFrame(sqlContext, ldf)
> printSchema(sdf)
> ## root
> ##  |-- x: array (nullable = true)
> ##  ||-- element: map (containsNull = true)
> ##  |||-- key: string
> ##  |||-- value: double (valueContainsNull = true)
> {code}
> but it throws:
> {code}
> java.lang.IllegalArgumentException: Invalid array type e
> {code}
> on collect / head. 
> Problem seems to be specific to environments and cannot be reproduced when 
> Map comes for example from Cassandra table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11281) Issue with creating and collecting DataFrame using environments

2015-11-16 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006955#comment-15006955
 ] 

Shivaram Venkataraman commented on SPARK-11281:
---

[~sunrui] [~zero323] Is there a test case in 
https://github.com/apache/spark/commit/d7d9fa0b8750166f8b74f9bc321df26908683a8b 
that covers this ? 

> Issue with creating and collecting DataFrame using environments 
> 
>
> Key: SPARK-11281
> URL: https://issues.apache.org/jira/browse/SPARK-11281
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
> Environment: R 3.2.2, Spark build from master  
> 487d409e71767c76399217a07af8de1bb0da7aa8
>Reporter: Maciej Szymkiewicz
> Fix For: 1.6.0
>
>
> It is not possible to to access Map field created from an environment. 
> Assuming local data frame is created as follows:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3)))
> str(ldf)
> ## 'data.frame':  2 obs. of  1 variable:
> ##  $ x:List of 2
> ##   ..$ : 
> ##   ..$ : 
> get("a", ldf$x[[1]])
> ## [1] 1
> get("c", ldf$x[[2]])
> ## [1] 3
> {code}
> It is possible to create a Spark data frame:
> {code}
> sdf <- createDataFrame(sqlContext, ldf)
> printSchema(sdf)
> ## root
> ##  |-- x: array (nullable = true)
> ##  ||-- element: map (containsNull = true)
> ##  |||-- key: string
> ##  |||-- value: double (valueContainsNull = true)
> {code}
> but it throws:
> {code}
> java.lang.IllegalArgumentException: Invalid array type e
> {code}
> on collect / head. 
> Problem seems to be specific to environments and cannot be reproduced when 
> Map comes for example from Cassandra table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11281) Issue with creating and collecting DataFrame using environments

2015-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11281:
--
Assignee: Maciej Szymkiewicz

> Issue with creating and collecting DataFrame using environments 
> 
>
> Key: SPARK-11281
> URL: https://issues.apache.org/jira/browse/SPARK-11281
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
> Environment: R 3.2.2, Spark build from master  
> 487d409e71767c76399217a07af8de1bb0da7aa8
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
> Fix For: 1.6.0
>
>
> It is not possible to to access Map field created from an environment. 
> Assuming local data frame is created as follows:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3)))
> str(ldf)
> ## 'data.frame':  2 obs. of  1 variable:
> ##  $ x:List of 2
> ##   ..$ : 
> ##   ..$ : 
> get("a", ldf$x[[1]])
> ## [1] 1
> get("c", ldf$x[[2]])
> ## [1] 3
> {code}
> It is possible to create a Spark data frame:
> {code}
> sdf <- createDataFrame(sqlContext, ldf)
> printSchema(sdf)
> ## root
> ##  |-- x: array (nullable = true)
> ##  ||-- element: map (containsNull = true)
> ##  |||-- key: string
> ##  |||-- value: double (valueContainsNull = true)
> {code}
> but it throws:
> {code}
> java.lang.IllegalArgumentException: Invalid array type e
> {code}
> on collect / head. 
> Problem seems to be specific to environments and cannot be reproduced when 
> Map comes for example from Cassandra table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11281) Issue with creating and collecting DataFrame using environments

2015-11-16 Thread Maciej Szymkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006960#comment-15006960
 ] 

Maciej Szymkiewicz commented on SPARK-11281:


[~shivaram] No, there isn't. I removed this one because there was nothing we 
could test there. 

> Issue with creating and collecting DataFrame using environments 
> 
>
> Key: SPARK-11281
> URL: https://issues.apache.org/jira/browse/SPARK-11281
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
> Environment: R 3.2.2, Spark build from master  
> 487d409e71767c76399217a07af8de1bb0da7aa8
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
> Fix For: 1.6.0
>
>
> It is not possible to to access Map field created from an environment. 
> Assuming local data frame is created as follows:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3)))
> str(ldf)
> ## 'data.frame':  2 obs. of  1 variable:
> ##  $ x:List of 2
> ##   ..$ : 
> ##   ..$ : 
> get("a", ldf$x[[1]])
> ## [1] 1
> get("c", ldf$x[[2]])
> ## [1] 3
> {code}
> It is possible to create a Spark data frame:
> {code}
> sdf <- createDataFrame(sqlContext, ldf)
> printSchema(sdf)
> ## root
> ##  |-- x: array (nullable = true)
> ##  ||-- element: map (containsNull = true)
> ##  |||-- key: string
> ##  |||-- value: double (valueContainsNull = true)
> {code}
> but it throws:
> {code}
> java.lang.IllegalArgumentException: Invalid array type e
> {code}
> on collect / head. 
> Problem seems to be specific to environments and cannot be reproduced when 
> Map comes for example from Cassandra table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-11281) Issue with creating and collecting DataFrame using environments

2015-11-16 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reopened SPARK-11281:
---
  Assignee: (was: Maciej Szymkiewicz)

> Issue with creating and collecting DataFrame using environments 
> 
>
> Key: SPARK-11281
> URL: https://issues.apache.org/jira/browse/SPARK-11281
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
> Environment: R 3.2.2, Spark build from master  
> 487d409e71767c76399217a07af8de1bb0da7aa8
>Reporter: Maciej Szymkiewicz
> Fix For: 1.6.0
>
>
> It is not possible to to access Map field created from an environment. 
> Assuming local data frame is created as follows:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3)))
> str(ldf)
> ## 'data.frame':  2 obs. of  1 variable:
> ##  $ x:List of 2
> ##   ..$ : 
> ##   ..$ : 
> get("a", ldf$x[[1]])
> ## [1] 1
> get("c", ldf$x[[2]])
> ## [1] 3
> {code}
> It is possible to create a Spark data frame:
> {code}
> sdf <- createDataFrame(sqlContext, ldf)
> printSchema(sdf)
> ## root
> ##  |-- x: array (nullable = true)
> ##  ||-- element: map (containsNull = true)
> ##  |||-- key: string
> ##  |||-- value: double (valueContainsNull = true)
> {code}
> but it throws:
> {code}
> java.lang.IllegalArgumentException: Invalid array type e
> {code}
> on collect / head. 
> Problem seems to be specific to environments and cannot be reproduced when 
> Map comes for example from Cassandra table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11716) UDFRegistration Drops Input Type Information

2015-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006979#comment-15006979
 ] 

Apache Spark commented on SPARK-11716:
--

User 'jbonofre' has created a pull request for this issue:
https://github.com/apache/spark/pull/9739

> UDFRegistration Drops Input Type Information
> 
>
> Key: SPARK-11716
> URL: https://issues.apache.org/jira/browse/SPARK-11716
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Artjom Metro
>Priority: Minor
>  Labels: sql, udf
>
> The UserDefinedFunction returned by the UDFRegistration does not contain the 
> input type information, although that information is available.
> To fix the issue the last line of every register function would had to be 
> changed to "UserDefinedFunction(func, dataType, inputType)" or is there any 
> specific reason this was not done?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11617) MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected

2015-11-16 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006998#comment-15006998
 ] 

Marcelo Vanzin edited comment on SPARK-11617 at 11/16/15 5:53 PM:
--

Can you post the exceptions if they're different than the ones you posted 
before?


was (Author: vanzin):
Can you post the exception if they're different than the ones you posted before?

> MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected
> ---
>
> Key: SPARK-11617
> URL: https://issues.apache.org/jira/browse/SPARK-11617
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: LingZhou
>
> The problem may be related to
>  [SPARK-11235][NETWORK] Add ability to stream data using network lib.
> while running on yarn-client mode, there are error messages:
> 15/11/09 10:23:55 ERROR util.ResourceLeakDetector: LEAK: ByteBuf.release() 
> was not called before it's garbage-collected. Enable advanced leak reporting 
> to find out where the leak occurred. To enable advanced leak reporting, 
> specify the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call 
> ResourceLeakDetector.setLevel() See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> and then it will cause 
> cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN 
> for exceeding memory limits. 9.0 GB of 9 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> and WARN scheduler.TaskSetManager: Lost task 105.0 in stage 1.0 (TID 2616, 
> gsr489): java.lang.IndexOutOfBoundsException: index: 130828, length: 16833 
> (expected: range(0, 524288)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10181) HiveContext is not used with keytab principal but with user principal/unix username

2015-11-16 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10181:
-
Fix Version/s: 1.5.3

> HiveContext is not used with keytab principal but with user principal/unix 
> username
> ---
>
> Key: SPARK-10181
> URL: https://issues.apache.org/jira/browse/SPARK-10181
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: kerberos
>Reporter: Bolke de Bruin
>Assignee: Yu Gao
>  Labels: hive, hivecontext, kerberos
> Fix For: 1.5.3, 1.6.0
>
>
> `bin/spark-submit --num-executors 1 --executor-cores 5 --executor-memory 5G  
> --driver-java-options -XX:MaxPermSize=4G --driver-class-path 
> lib/datanucleus-api-jdo-3.2.6.jar:lib/datanucleus-core-3.2.10.jar:lib/datanucleus-rdbms-3.2.9.jar:conf/hive-site.xml
>  --files conf/hive-site.xml --master yarn --principal sparkjob --keytab 
> /etc/security/keytabs/sparkjob.keytab --conf 
> spark.yarn.executor.memoryOverhead=18000 --conf 
> "spark.executor.extraJavaOptions=-XX:MaxPermSize=4G" --conf 
> spark.eventLog.enabled=false ~/test.py`
> With:
> #!/usr/bin/python
> from pyspark import SparkContext
> from pyspark.sql import HiveContext
> sc = SparkContext()
> sqlContext = HiveContext(sc)
> query = """ SELECT * FROM fm.sk_cluster """
> rdd = sqlContext.sql(query)
> rdd.registerTempTable("test")
> sqlContext.sql("CREATE TABLE wcs.test LOCATION '/tmp/test_gl' AS SELECT * 
> FROM test")
> Ends up with:
> Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
>  Permission denie
> d: user=ua80tl, access=READ_EXECUTE, 
> inode="/tmp/test_gl/.hive-staging_hive_2015-08-24_10-43-09_157_78057390024057878
> 34-1/-ext-1":sparkjob:hdfs:drwxr-x---
> (Our umask denies read access to other by default)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11512) Bucket Join

2015-11-16 Thread Alex Nastetsky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006975#comment-15006975
 ] 

Alex Nastetsky commented on SPARK-11512:


There are 3 situations:

1) dataset A and dataset B are both partitioned/sorted the same on disk and 
need to be joined. should be able to take advantage of their partitioning/sort.
2) dataset A is partitioned/sorted on disk, dataset B gets generated during the 
app run and needs to be joined to dataset A. should be able to take advantage 
of dataset A's partitioning/sort and mimic the same partitioning/sort on 
dataset B, without having to pre-process dataset A. perhaps, something like 
repartitionAndSortWithinPartitions to be performed on dataset B?
3) dataset A and B are both generated during the app run and need to be joined. 
I believe doing a Sort Merge Join on these is already supported in SPARK-2213.

The first 2 situations is what this ticket is for.

> Bucket Join
> ---
>
> Key: SPARK-11512
> URL: https://issues.apache.org/jira/browse/SPARK-11512
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>
> Sort merge join on two datasets on the file system that have already been 
> partitioned the same with the same number of partitions and sorted within 
> each partition, and we don't need to sort it again while join with the 
> sorted/partitioned keys
> This functionality exists in
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
> - Pig (USING 'merge')
> - MapReduce (CompositeInputFormat)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-16 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006978#comment-15006978
 ] 

shane knapp commented on SPARK-11655:
-

just wanted to say that things are definitely looking a LOT better!  i'll keep 
an eye on things this week, but we're definitely out of the woods.

thanks [~joshrosen] and [~vanzin]!

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Marcelo Vanzin
>Priority: Blocker
> Fix For: 1.6.0
>
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11281) Issue with creating and collecting DataFrame using environments

2015-11-16 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007064#comment-15007064
 ] 

Shivaram Venkataraman commented on SPARK-11281:
---

Thats cool ! lets keep this open till we add tests and then close it as a part 
of that PR

> Issue with creating and collecting DataFrame using environments 
> 
>
> Key: SPARK-11281
> URL: https://issues.apache.org/jira/browse/SPARK-11281
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
> Environment: R 3.2.2, Spark build from master  
> 487d409e71767c76399217a07af8de1bb0da7aa8
>Reporter: Maciej Szymkiewicz
> Fix For: 1.6.0
>
>
> It is not possible to to access Map field created from an environment. 
> Assuming local data frame is created as follows:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3)))
> str(ldf)
> ## 'data.frame':  2 obs. of  1 variable:
> ##  $ x:List of 2
> ##   ..$ : 
> ##   ..$ : 
> get("a", ldf$x[[1]])
> ## [1] 1
> get("c", ldf$x[[2]])
> ## [1] 3
> {code}
> It is possible to create a Spark data frame:
> {code}
> sdf <- createDataFrame(sqlContext, ldf)
> printSchema(sdf)
> ## root
> ##  |-- x: array (nullable = true)
> ##  ||-- element: map (containsNull = true)
> ##  |||-- key: string
> ##  |||-- value: double (valueContainsNull = true)
> {code}
> but it throws:
> {code}
> java.lang.IllegalArgumentException: Invalid array type e
> {code}
> on collect / head. 
> Problem seems to be specific to environments and cannot be reproduced when 
> Map comes for example from Cassandra table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11569) StringIndexer transform fails when column contains nulls

2015-11-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007131#comment-15007131
 ] 

Joseph K. Bradley commented on SPARK-11569:
---

To choose the right API, my first comments are:
* What do other libraries do when given null/bad values?  (scikit-learn and R 
are the ones I tend to look at.)
* I'd prefer to make the behavior adjustable using an option with a default.  
The default I'd vote for is throwing a nice error upon seeing null, though I 
could be convinced to go for another.
* When we do index null, we should ideally maintain current indexing behavior, 
so it may make the most sense to put null at the end.

> StringIndexer transform fails when column contains nulls
> 
>
> Key: SPARK-11569
> URL: https://issues.apache.org/jira/browse/SPARK-11569
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>
> Transforming column containing {{null}} values using {{StringIndexer}} 
> results in {{java.lang.NullPointerException}}
> {code}
> from pyspark.ml.feature import StringIndexer
> df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v"))
> df.printSchema()
> ## root
> ##  |-- k: string (nullable = true)
> ##  |-- v: long (nullable = true)
> indexer = StringIndexer(inputCol="k", outputCol="kIdx")
> indexer.fit(df).transform(df)
> ##  py4j.protocol.Py4JJavaError: An error occurred while calling o75.json.
> ## : java.lang.NullPointerException
> {code}
> Problem disappears when we drop 
> {code}
> df1 = df.na.drop()
> indexer.fit(df1).transform(df1)
> {code}
> or replace {{nulls}}
> {code}
> from pyspark.sql.functions import col, when
> k = col("k")
> df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k))
> indexer.fit(df2).transform(df2)
> {code}
> and cannot be reproduced using Scala API
> {code}
> import org.apache.spark.ml.feature.StringIndexer
> val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v")
> df.printSchema
> // root
> //  |-- k: string (nullable = true)
> //  |-- v: integer (nullable = false)
> val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx")
> indexer.fit(df).transform(df).count
> // 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11044) Parquet writer version fixed as version1

2015-11-16 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11044:
---
Fix Version/s: 1.6.0

> Parquet writer version fixed as version1
> 
>
> Key: SPARK-11044
> URL: https://issues.apache.org/jira/browse/SPARK-11044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 1.6.0, 1.7.0
>
>
> Spark only writes the parquet files with writer version1 ignoring given 
> configuration.
> It should let users choose the writer version. (remaining the default as 
> version1).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11743) Add UserDefinedType support to RowEncoder

2015-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11743:
--
Assignee: Liang-Chi Hsieh

> Add UserDefinedType support to RowEncoder
> -
>
> Key: SPARK-11743
> URL: https://issues.apache.org/jira/browse/SPARK-11743
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 1.6.0
>
>
> RowEncoder doesn't support UserDefinedType now. We should add the support for 
> it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11752) fix timezone problem for DateTimeUtils.getSeconds

2015-11-16 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11752:
--
Assignee: Wenchen Fan

> fix timezone problem for DateTimeUtils.getSeconds
> -
>
> Key: SPARK-11752
> URL: https://issues.apache.org/jira/browse/SPARK-11752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.5.3, 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11716) UDFRegistration Drops Input Type Information

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11716:


Assignee: (was: Apache Spark)

> UDFRegistration Drops Input Type Information
> 
>
> Key: SPARK-11716
> URL: https://issues.apache.org/jira/browse/SPARK-11716
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Artjom Metro
>Priority: Minor
>  Labels: sql, udf
>
> The UserDefinedFunction returned by the UDFRegistration does not contain the 
> input type information, although that information is available.
> To fix the issue the last line of every register function would had to be 
> changed to "UserDefinedFunction(func, dataType, inputType)" or is there any 
> specific reason this was not done?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11716) UDFRegistration Drops Input Type Information

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11716:


Assignee: Apache Spark

> UDFRegistration Drops Input Type Information
> 
>
> Key: SPARK-11716
> URL: https://issues.apache.org/jira/browse/SPARK-11716
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Artjom Metro
>Assignee: Apache Spark
>Priority: Minor
>  Labels: sql, udf
>
> The UserDefinedFunction returned by the UDFRegistration does not contain the 
> input type information, although that information is available.
> To fix the issue the last line of every register function would had to be 
> changed to "UserDefinedFunction(func, dataType, inputType)" or is there any 
> specific reason this was not done?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-11281) Issue with creating and collecting DataFrame using environments

2015-11-16 Thread Maciej Szymkiewicz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-11281:
---
Comment: was deleted

(was: [~sunrui], [~shivaram] I don't think it is resolved by [SPARK-11086]. 
)

> Issue with creating and collecting DataFrame using environments 
> 
>
> Key: SPARK-11281
> URL: https://issues.apache.org/jira/browse/SPARK-11281
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
> Environment: R 3.2.2, Spark build from master  
> 487d409e71767c76399217a07af8de1bb0da7aa8
>Reporter: Maciej Szymkiewicz
> Fix For: 1.6.0
>
>
> It is not possible to to access Map field created from an environment. 
> Assuming local data frame is created as follows:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3)))
> str(ldf)
> ## 'data.frame':  2 obs. of  1 variable:
> ##  $ x:List of 2
> ##   ..$ : 
> ##   ..$ : 
> get("a", ldf$x[[1]])
> ## [1] 1
> get("c", ldf$x[[2]])
> ## [1] 3
> {code}
> It is possible to create a Spark data frame:
> {code}
> sdf <- createDataFrame(sqlContext, ldf)
> printSchema(sdf)
> ## root
> ##  |-- x: array (nullable = true)
> ##  ||-- element: map (containsNull = true)
> ##  |||-- key: string
> ##  |||-- value: double (valueContainsNull = true)
> {code}
> but it throws:
> {code}
> java.lang.IllegalArgumentException: Invalid array type e
> {code}
> on collect / head. 
> Problem seems to be specific to environments and cannot be reproduced when 
> Map comes for example from Cassandra table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11617) MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected

2015-11-16 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15006998#comment-15006998
 ] 

Marcelo Vanzin commented on SPARK-11617:


Can you post the exception if they're different than the ones you posted before?

> MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected
> ---
>
> Key: SPARK-11617
> URL: https://issues.apache.org/jira/browse/SPARK-11617
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: LingZhou
>
> The problem may be related to
>  [SPARK-11235][NETWORK] Add ability to stream data using network lib.
> while running on yarn-client mode, there are error messages:
> 15/11/09 10:23:55 ERROR util.ResourceLeakDetector: LEAK: ByteBuf.release() 
> was not called before it's garbage-collected. Enable advanced leak reporting 
> to find out where the leak occurred. To enable advanced leak reporting, 
> specify the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call 
> ResourceLeakDetector.setLevel() See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> and then it will cause 
> cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN 
> for exceeding memory limits. 9.0 GB of 9 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> and WARN scheduler.TaskSetManager: Lost task 105.0 in stage 1.0 (TID 2616, 
> gsr489): java.lang.IndexOutOfBoundsException: index: 130828, length: 16833 
> (expected: range(0, 524288)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11713) Initial RDD for updateStateByKey for pyspark

2015-11-16 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007004#comment-15007004
 ] 

Bryan Cutler commented on SPARK-11713:
--

I could work on this

> Initial RDD for updateStateByKey for pyspark
> 
>
> Key: SPARK-11713
> URL: https://issues.apache.org/jira/browse/SPARK-11713
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: David Watson
>
> It would be infinitely useful to add initial rdd to the pyspark DStream 
> interface to match the scala and java interfaces 
> (https://issues.apache.org/jira/browse/SPARK-3660).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11553) row.getInt(i) if row[i]=null returns 0

2015-11-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11553:
-
Target Version/s: 1.6.0
Priority: Blocker  (was: Minor)

> row.getInt(i) if row[i]=null returns 0
> --
>
> Key: SPARK-11553
> URL: https://issues.apache.org/jira/browse/SPARK-11553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tofigh
>Priority: Blocker
>
> row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even 
> according to the document they should throw nullException error)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11718) Explicit killing executor dies silent without get response information

2015-11-16 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11718.

   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 1.6.0

> Explicit killing executor dies silent without get response information
> --
>
> Key: SPARK-11718
> URL: https://issues.apache.org/jira/browse/SPARK-11718
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, YARN
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
> Fix For: 1.6.0
>
>
> Because of change of AM and scheduler executor failure detection mechanism, 
> explicit killing executor can not response back to driver, this will make 
> dynamic allocation wrongly maintain the executor metadata.
> I'm working on this...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11684) Update user guide to show new features in SparkR:::glm and SparkR:::summary

2015-11-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11684:
--
Shepherd: Xiangrui Meng

> Update user guide to show new features in SparkR:::glm and SparkR:::summary
> ---
>
> Key: SPARK-11684
> URL: https://issues.apache.org/jira/browse/SPARK-11684
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> * feature interaction in R formula
> * model coefficients in logistic regression
> * model summary in linear regression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11439) Optimization of creating sparse feature without dense one

2015-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007482#comment-15007482
 ] 

Apache Spark commented on SPARK-11439:
--

User 'nakul02' has created a pull request for this issue:
https://github.com/apache/spark/pull/9745

> Optimization of creating sparse feature without dense one
> -
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Minor
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11439) Optimization of creating sparse feature without dense one

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11439:


Assignee: Apache Spark

> Optimization of creating sparse feature without dense one
> -
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11762) TransportResponseHandler should consider open streams when counting outstanding requests

2015-11-16 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-11762:
--

 Summary: TransportResponseHandler should consider open streams 
when counting outstanding requests
 Key: SPARK-11762
 URL: https://issues.apache.org/jira/browse/SPARK-11762
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.0
Reporter: Marcelo Vanzin
Priority: Minor


This code in TransportResponseHandler:

{code}
  public int numOutstandingRequests() {
return outstandingFetches.size() + outstandingRpcs.size();
  }
{code}

Is used to determine if the channel is currently in use; if there's a timeout 
and the channel is in use, then the channel is closed. But it currently does 
not consider open streams (just block fetches and RPCs), so if a timeout 
happens during a stream transfer, the channel will remain open.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-11-16 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reopened SPARK-11016:

  Assignee: (was: Liang-Chi Hsieh)

https://github.com/apache/spark/pull/9243 is reverted

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
> Fix For: 1.6.0
>
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-11-16 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007565#comment-15007565
 ] 

Davies Liu commented on SPARK-11016:


[~charles.al...@acxiom.com] Could you send your patch to 
github.com/apache/spark ?

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
> Fix For: 1.6.0
>
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11756) SparkR can not output help information for SparkR:::summary correctly

2015-11-16 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007481#comment-15007481
 ] 

Felix Cheung commented on SPARK-11756:
--

[~yanboliang] Could you please clarify what is the issue? That it shows 
'Summaries {base}'? Or it says 'describe {SparkR}'?

> SparkR can not output help information for SparkR:::summary correctly
> -
>
> Key: SPARK-11756
> URL: https://issues.apache.org/jira/browse/SPARK-11756
> Project: Spark
>  Issue Type: Bug
>  Components: R, SparkR
>Reporter: Yanbo Liang
>
> R users often get help information for a method like this:
> {code}
> > ?summary
> {code}
> or 
> {code}
> > help(summary)
> {code}
> For SparkR we should provide the help information for the SparkR package and 
> base R package(usually stats package).
> But for "summary" method, it can not output the help information correctly.
> {code}
> > help(summary)
> Help on topic ‘summary’ was found in the following packages:
>   Package   Library
>   SparkR/Users/yanboliang/data/trunk2/spark/R/lib
>   base  /Library/Frameworks/R.framework/Resources/library
> Choose one 
> 1: describe {SparkR}
> 2: Object Summaries {base}
> {code}
> It can only output the help of describe(DataFrame) which is synonymous with 
> summary(DataFrame), we also need the help information of 
> summary(PipelineModel).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11390) Query plan with/without filterPushdown indistinguishable

2015-11-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11390.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9679
[https://github.com/apache/spark/pull/9679]

> Query plan with/without filterPushdown indistinguishable
> 
>
> Key: SPARK-11390
> URL: https://issues.apache.org/jira/browse/SPARK-11390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: All
>Reporter: Vishesh Garg
>Priority: Minor
> Fix For: 1.6.0
>
>
> The execution plan of a query remains the same regardless of whether the 
> filterPushdown flag has been set to "true" or "false", as can be seen below: 
> ==
> scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "false")
> scala> sqlContext.sql("SELECT name FROM people WHERE age = 15").explain()
> == Physical Plan ==
> Project [name#6]
>  Filter (age#7 = 15)
>   Scan OrcRelation[hdfs://localhost:9000/user/spec/people][name#6,age#7]
> scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> scala> sqlContext.sql("SELECT name FROM people WHERE age = 15").explain()
> == Physical Plan ==
> Project [name#6]
>  Filter (age#7 = 15)
>   Scan OrcRelation[hdfs://localhost:9000/user/spec/people][name#6,age#7]
> ==
> Ideally, when the filterPushdown flag is set to "true", both the scan and the 
> filter nodes should be merged together to make it clear that the filtering is 
> being done by the data source itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11756) SparkR can not output help information for SparkR:::summary correctly

2015-11-16 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-11756:
-
Component/s: SparkR

> SparkR can not output help information for SparkR:::summary correctly
> -
>
> Key: SPARK-11756
> URL: https://issues.apache.org/jira/browse/SPARK-11756
> Project: Spark
>  Issue Type: Bug
>  Components: R, SparkR
>Reporter: Yanbo Liang
>
> R users often get help information for a method like this:
> {code}
> > ?summary
> {code}
> or 
> {code}
> > help(summary)
> {code}
> For SparkR we should provide the help information for the SparkR package and 
> base R package(usually stats package).
> But for "summary" method, it can not output the help information correctly.
> {code}
> > help(summary)
> Help on topic ‘summary’ was found in the following packages:
>   Package   Library
>   SparkR/Users/yanboliang/data/trunk2/spark/R/lib
>   base  /Library/Frameworks/R.framework/Resources/library
> Choose one 
> 1: describe {SparkR}
> 2: Object Summaries {base}
> {code}
> It can only output the help of describe(DataFrame) which is synonymous with 
> summary(DataFrame), we also need the help information of 
> summary(PipelineModel).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11439) Optimization of creating sparse feature without dense one

2015-11-16 Thread Nakul Jindal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007487#comment-15007487
 ] 

Nakul Jindal commented on SPARK-11439:
--

Thanks [~lewuathe].
I've also updated the comment in the LinearRegressionSuite.scala file with an R 
snippet to reproduce the results.

> Optimization of creating sparse feature without dense one
> -
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Minor
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11747) Can not specify input path in python logistic_regression example under ml

2015-11-16 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007528#comment-15007528
 ] 

Joseph K. Bradley commented on SPARK-11747:
---

Although there are some examples which are essentially command-line scripts, 
most examples are really meant to be copied and modified as needed.  We may 
need to wait on this, depending on how testable example code refactoring 
happens: [SPARK-11337]

> Can not specify input path in python logistic_regression example under ml
> -
>
> Key: SPARK-11747
> URL: https://issues.apache.org/jira/browse/SPARK-11747
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: Jeff Zhang
>Priority: Minor
>
> Not sure why it is hard coded, it would be nice to allow user to specify 
> input path
> {code}
> # Load and parse the data file into a dataframe.
> df = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11390) Query plan with/without filterPushdown indistinguishable

2015-11-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11390:
-
Assignee: Zee Chen

> Query plan with/without filterPushdown indistinguishable
> 
>
> Key: SPARK-11390
> URL: https://issues.apache.org/jira/browse/SPARK-11390
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: All
>Reporter: Vishesh Garg
>Assignee: Zee Chen
>Priority: Minor
> Fix For: 1.6.0
>
>
> The execution plan of a query remains the same regardless of whether the 
> filterPushdown flag has been set to "true" or "false", as can be seen below: 
> ==
> scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "false")
> scala> sqlContext.sql("SELECT name FROM people WHERE age = 15").explain()
> == Physical Plan ==
> Project [name#6]
>  Filter (age#7 = 15)
>   Scan OrcRelation[hdfs://localhost:9000/user/spec/people][name#6,age#7]
> scala> sqlContext.setConf("spark.sql.orc.filterPushdown", "true")
> scala> sqlContext.sql("SELECT name FROM people WHERE age = 15").explain()
> == Physical Plan ==
> Project [name#6]
>  Filter (age#7 = 15)
>   Scan OrcRelation[hdfs://localhost:9000/user/spec/people][name#6,age#7]
> ==
> Ideally, when the filterPushdown flag is set to "true", both the scan and the 
> filter nodes should be merged together to make it clear that the filtering is 
> being done by the data source itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11439) Optimization of creating sparse feature without dense one

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11439:


Assignee: (was: Apache Spark)

> Optimization of creating sparse feature without dense one
> -
>
> Key: SPARK-11439
> URL: https://issues.apache.org/jira/browse/SPARK-11439
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Minor
>
> Currently, sparse feature generated in {{LinearDataGenerator}} needs to 
> create dense vectors once. It is cost efficient to prevent from generating 
> dense feature when creating sparse features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-11271) MapStatus too large for driver

2015-11-16 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reopened SPARK-11271:


https://github.com/apache/spark/pull/9243 is reverted

> MapStatus too large for driver
> --
>
> Key: SPARK-11271
> URL: https://issues.apache.org/jira/browse/SPARK-11271
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Reporter: Kent Yao
>Assignee: Liang-Chi Hsieh
> Fix For: 1.6.0
>
>
> When I run a spark job contains quite a lot of tasks(in my case is 
> 200k[maptasks]*200k[reducetasks]), the driver occured OOM mainly caused by 
> the object MapStatus,
> RoaringBitmap that used to mark which block is empty seems to use too many 
> memories.
> I try to use org.apache.spark.util.collection.BitSet instead of 
> RoaringBitMap, and it can save about 20% memories.
> For the 200K tasks job, 
> RoaringBitMap uses 3 Long[1024] and 1 Short[3392] 
> =3*64*1024+16*3392=250880(bit) 
> BitSet uses 1 Long[3125] = 3125*64=20(bit) 
> Memory saved = (250880-20) / 250880 ≈20%



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6328) Python API for StreamingListener

2015-11-16 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-6328.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> Python API for StreamingListener
> 
>
> Key: SPARK-6328
> URL: https://issues.apache.org/jira/browse/SPARK-6328
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Yifan Wang
> Fix For: 1.6.0
>
>
> StreamingListener API is only available in Java/Scala. It will be useful to 
> make it available in Python so that Spark application written in python can 
> check the status of ongoing streaming computation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11720:
--
Component/s: ML

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Jihong MA
>Priority: Minor
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11720) Return Double.NaN instead of null for Mean and Average when count = 0

2015-11-16 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11720:
--
Assignee: Jihong MA

> Return Double.NaN instead of null for Mean and Average when count = 0
> -
>
> Key: SPARK-11720
> URL: https://issues.apache.org/jira/browse/SPARK-11720
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Jihong MA
>Priority: Minor
>
> change the default behavior of mean in case of count = 0 from null to 
> Double.NaN, to make it inline with all other univariate stats function. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11761) Prevent the call to StreamingContext#stop() in the listener bus's thread

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11761:


Assignee: Apache Spark

> Prevent the call to StreamingContext#stop() in the listener bus's thread
> 
>
> Key: SPARK-11761
> URL: https://issues.apache.org/jira/browse/SPARK-11761
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Ted Yu
>Assignee: Apache Spark
>
> Quoting Shixiong's comment from https://github.com/apache/spark/pull/9723 :
> {code}
> The user should not call stop or other long-time work in a listener since it 
> will block the listener thread, and prevent from stopping 
> SparkContext/StreamingContext.
> I cannot see an approach since we need to stop the listener bus's thread 
> before stopping SparkContext/StreamingContext totally.
> {code}
> Proposed solution is to prevent the call to StreamingContext#stop() in the 
> listener bus's thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11761) Prevent the call to StreamingContext#stop() in the listener bus's thread

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11761:


Assignee: (was: Apache Spark)

> Prevent the call to StreamingContext#stop() in the listener bus's thread
> 
>
> Key: SPARK-11761
> URL: https://issues.apache.org/jira/browse/SPARK-11761
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Ted Yu
>
> Quoting Shixiong's comment from https://github.com/apache/spark/pull/9723 :
> {code}
> The user should not call stop or other long-time work in a listener since it 
> will block the listener thread, and prevent from stopping 
> SparkContext/StreamingContext.
> I cannot see an approach since we need to stop the listener bus's thread 
> before stopping SparkContext/StreamingContext totally.
> {code}
> Proposed solution is to prevent the call to StreamingContext#stop() in the 
> listener bus's thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11761) Prevent the call to StreamingContext#stop() in the listener bus's thread

2015-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007215#comment-15007215
 ] 

Apache Spark commented on SPARK-11761:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9741

> Prevent the call to StreamingContext#stop() in the listener bus's thread
> 
>
> Key: SPARK-11761
> URL: https://issues.apache.org/jira/browse/SPARK-11761
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Ted Yu
>
> Quoting Shixiong's comment from https://github.com/apache/spark/pull/9723 :
> {code}
> The user should not call stop or other long-time work in a listener since it 
> will block the listener thread, and prevent from stopping 
> SparkContext/StreamingContext.
> I cannot see an approach since we need to stop the listener bus's thread 
> before stopping SparkContext/StreamingContext totally.
> {code}
> Proposed solution is to prevent the call to StreamingContext#stop() in the 
> listener bus's thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11754) consolidate `ExpressionEncoder.tuple` and `Encoders.tuple`

2015-11-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11754.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9729
[https://github.com/apache/spark/pull/9729]

> consolidate `ExpressionEncoder.tuple` and `Encoders.tuple`
> --
>
> Key: SPARK-11754
> URL: https://issues.apache.org/jira/browse/SPARK-11754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11754) consolidate `ExpressionEncoder.tuple` and `Encoders.tuple`

2015-11-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11754:
-
Assignee: Wenchen Fan

> consolidate `ExpressionEncoder.tuple` and `Encoders.tuple`
> --
>
> Key: SPARK-11754
> URL: https://issues.apache.org/jira/browse/SPARK-11754
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11732) MiMa excludes miss private classes

2015-11-16 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11732:
--
Target Version/s: 1.6.0

> MiMa excludes miss private classes
> --
>
> Key: SPARK-11732
> URL: https://issues.apache.org/jira/browse/SPARK-11732
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Tim Hunter
>Assignee: Tim Hunter
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The checks in GenerateMIMAIgnore only check for package private classes, not 
> private classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11732) MiMa excludes miss private classes

2015-11-16 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-11732:
--
Assignee: Tim Hunter

> MiMa excludes miss private classes
> --
>
> Key: SPARK-11732
> URL: https://issues.apache.org/jira/browse/SPARK-11732
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.1
>Reporter: Tim Hunter
>Assignee: Tim Hunter
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The checks in GenerateMIMAIgnore only check for package private classes, not 
> private classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11731) Enable batching on Driver WriteAheadLog by default

2015-11-16 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-11731.
---
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 1.6.0

> Enable batching on Driver WriteAheadLog by default
> --
>
> Key: SPARK-11731
> URL: https://issues.apache.org/jira/browse/SPARK-11731
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 1.6.0
>
>
> Using batching on the driver for the WriteAheadLog should be an improvement 
> for all environments and use cases. Users will be able to scale to much 
> higher number of receivers with the BatchedWriteAheadLog. Therefore we should 
> turn it on by default, and QA it in the QA period.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11761) Prevent the call to StreamingContext#stop() in the listener bus's thread

2015-11-16 Thread Ted Yu (JIRA)

Ted Yu created SPARK-11761:
--

 Summary: Prevent the call to StreamingContext#stop() in the 
listener bus's thread
 Key: SPARK-11761
 URL: https://issues.apache.org/jira/browse/SPARK-11761
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Ted Yu


Quoting Shixiong's comment from https://github.com/apache/spark/pull/9723 :
{code}
The user should not call stop or other long-time work in a listener since it 
will block the listener thread, and prevent from stopping 
SparkContext/StreamingContext.

I cannot see an approach since we need to stop the listener bus's thread before 
stopping SparkContext/StreamingContext totally.
{code}
Proposed solution is to prevent the call to StreamingContext#stop() in the 
listener bus's thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11319) PySpark silently Accepts null values in non-nullable DataFrame fields.

2015-11-16 Thread Daniel Jalova (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007220#comment-15007220
 ] 

Daniel Jalova commented on SPARK-11319:
---

Seems that this is possible in the Scala API too.

> PySpark silently Accepts null values in non-nullable DataFrame fields.
> --
>
> Key: SPARK-11319
> URL: https://issues.apache.org/jira/browse/SPARK-11319
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Kevin Cox
>
> Running the following code with a null value in a non-nullable column 
> silently works. This makes the code incredibly hard to trust.
> {code}
> In [2]: from pyspark.sql.types import *
> In [3]: sqlContext.createDataFrame([(None,)], StructType([StructField("a", 
> TimestampType(), False)])).collect()
> Out[3]: [Row(a=None)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6328) Python API for StreamingListener

2015-11-16 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-6328:
-
Assignee: Yifan Wang

> Python API for StreamingListener
> 
>
> Key: SPARK-6328
> URL: https://issues.apache.org/jira/browse/SPARK-6328
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Yifan Wang
>Assignee: Yifan Wang
> Fix For: 1.6.0
>
>
> StreamingListener API is only available in Java/Scala. It will be useful to 
> make it available in Python so that Spark application written in python can 
> check the status of ongoing streaming computation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11281) Issue with creating and collecting DataFrame using environments

2015-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007272#comment-15007272
 ] 

Apache Spark commented on SPARK-11281:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/9743

> Issue with creating and collecting DataFrame using environments 
> 
>
> Key: SPARK-11281
> URL: https://issues.apache.org/jira/browse/SPARK-11281
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
> Environment: R 3.2.2, Spark build from master  
> 487d409e71767c76399217a07af8de1bb0da7aa8
>Reporter: Maciej Szymkiewicz
> Fix For: 1.6.0
>
>
> It is not possible to to access Map field created from an environment. 
> Assuming local data frame is created as follows:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3)))
> str(ldf)
> ## 'data.frame':  2 obs. of  1 variable:
> ##  $ x:List of 2
> ##   ..$ : 
> ##   ..$ : 
> get("a", ldf$x[[1]])
> ## [1] 1
> get("c", ldf$x[[2]])
> ## [1] 3
> {code}
> It is possible to create a Spark data frame:
> {code}
> sdf <- createDataFrame(sqlContext, ldf)
> printSchema(sdf)
> ## root
> ##  |-- x: array (nullable = true)
> ##  ||-- element: map (containsNull = true)
> ##  |||-- key: string
> ##  |||-- value: double (valueContainsNull = true)
> {code}
> but it throws:
> {code}
> java.lang.IllegalArgumentException: Invalid array type e
> {code}
> on collect / head. 
> Problem seems to be specific to environments and cannot be reproduced when 
> Map comes for example from Cassandra table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11617) MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected

2015-11-16 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007315#comment-15007315
 ] 

Marcelo Vanzin commented on SPARK-11617:


BTW, I updated the PR with a test case that fails with the exceptions you saw 
if I disable the fix; they pass consistently with the fix applied. I also ran 
several jobs that do a lot of shuffles and didn't see any problems with the 
latest fix applied.

> MEMORY LEAK: ByteBuf.release() was not called before it's garbage-collected
> ---
>
> Key: SPARK-11617
> URL: https://issues.apache.org/jira/browse/SPARK-11617
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: LingZhou
>
> The problem may be related to
>  [SPARK-11235][NETWORK] Add ability to stream data using network lib.
> while running on yarn-client mode, there are error messages:
> 15/11/09 10:23:55 ERROR util.ResourceLeakDetector: LEAK: ByteBuf.release() 
> was not called before it's garbage-collected. Enable advanced leak reporting 
> to find out where the leak occurred. To enable advanced leak reporting, 
> specify the JVM option '-Dio.netty.leakDetectionLevel=advanced' or call 
> ResourceLeakDetector.setLevel() See 
> http://netty.io/wiki/reference-counted-objects.html for more information.
> and then it will cause 
> cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN 
> for exceeding memory limits. 9.0 GB of 9 GB physical memory used. Consider 
> boosting spark.yarn.executor.memoryOverhead.
> and WARN scheduler.TaskSetManager: Lost task 105.0 in stage 1.0 (TID 2616, 
> gsr489): java.lang.IndexOutOfBoundsException: index: 130828, length: 16833 
> (expected: range(0, 524288)).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9065) Add the ability to specify message handler function in python similar to Scala/Java

2015-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007192#comment-15007192
 ] 

Apache Spark commented on SPARK-9065:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/9742

> Add the ability to specify message handler function in python similar to 
> Scala/Java
> ---
>
> Key: SPARK-9065
> URL: https://issues.apache.org/jira/browse/SPARK-9065
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Streaming
>Reporter: Tathagata Das
>Assignee: Saisai Shao
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-11633) HiveContext throws TreeNode Exception : Failed to Copy Node

2015-11-16 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-11633:

Comment: was deleted

(was: Which version are you using? I did hit an error, but it is a different 
error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
resolve 'F1' given input columns keyCol1, keyCol2; line 1 pos 7

This is a self join issue. I will try to investigate the root cause. Thanks! )

> HiveContext throws TreeNode Exception : Failed to Copy Node
> ---
>
> Key: SPARK-11633
> URL: https://issues.apache.org/jira/browse/SPARK-11633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0, 1.5.1
>Reporter: Saurabh Santhosh
>Priority: Critical
>
> h2. HiveContext#sql is throwing the following exception in a specific 
> scenario :
> h2. Exception :
> Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Failed to copy node.
> Is otherCopyArgs specified correctly for LogicalRDD.
> Exception message: wrong number of arguments
> ctor: public org.apache.spark.sql.execution.LogicalRDD
> (scala.collection.Seq,org.apache.spark.rdd.RDD,org.apache.spark.sql.SQLContext)?
> h2. Code :
> {code:title=SparkClient.java|borderStyle=solid}
> StructField[] fields = new StructField[2];
> fields[0] = new StructField("F1", DataTypes.StringType, true, 
> Metadata.empty());
> fields[1] = new StructField("F2", DataTypes.StringType, true, 
> Metadata.empty());
> 
> JavaRDD rdd = 
> javaSparkContext.parallelize(Arrays.asList(RowFactory.create("", "")));
> DataFrame df = sparkHiveContext.createDataFrame(rdd, new StructType(fields));
> sparkHiveContext.registerDataFrameAsTable(df, "t1");
> DataFrame aliasedDf = sparkHiveContext.sql("select f1, F2 as F2 from t1");
> sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t2");
> sparkHiveContext.registerDataFrameAsTable(aliasedDf, "t3");
> sparkHiveContext.sql("select a.F1 from t2 a inner join t3 b on a.F2=b.F2");
> {code}
> h2. Observations :
> * if F1(exact name of field) is used instead of f1, the code works correctly.
> * If alias is not used for F2, then also code works irrespective of case of 
> F1.
> * if Field F2 is not used in the final query also the code works correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-9603) Re-enable complex R package test in SparkSubmitSuite

2015-11-16 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman reopened SPARK-9603:
--

We still have some failures on Jenkins as reported in 
https://github.com/apache/spark/pull/9390#issuecomment-157160063 and 
https://gist.github.com/shivaram/3a2fecce60768a603dac

> Re-enable complex R package test in SparkSubmitSuite
> 
>
> Key: SPARK-9603
> URL: https://issues.apache.org/jira/browse/SPARK-9603
> Project: Spark
>  Issue Type: Test
>  Components: Deploy, SparkR, Tests
>Affects Versions: 1.5.0
>Reporter: Burak Yavuz
>Assignee: Sun Rui
> Fix For: 1.6.0
>
>
> For building complex Spark Packages that contain R code in addition to Scala, 
> we have a complex procedure, where R source code is shipped inside a jar. The 
> source code is extracted, built, and is added as a library among SparkR.
> The end to end test in SparkSubmitSuite ("correctly builds R packages 
> included in a jar with --packages") can't run on Jenkins now, because the 
> pull request builder is not built with SparkR. Once the PR Builder is built 
> with SparkR, we should re-enable the test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11742) Show batch failures in the Streaming UI landing page

2015-11-16 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-11742.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Show batch failures in the Streaming UI landing page
> 
>
> Key: SPARK-11742
> URL: https://issues.apache.org/jira/browse/SPARK-11742
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11259) Params.validateParams() should be called automatically

2015-11-16 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11259:
--
Target Version/s: 1.6.1, 1.7.0

> Params.validateParams() should be called automatically
> --
>
> Key: SPARK-11259
> URL: https://issues.apache.org/jira/browse/SPARK-11259
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Params.validateParams() can not be called automatically currently. Such as 
> the following code snippet will not throw exception which is not as expected.
> {code}
> val df = sqlContext.createDataFrame(
>   Seq(
> (1, Vectors.dense(0.0, 1.0, 4.0), 1.0),
> (2, Vectors.dense(1.0, 0.0, 4.0), 2.0),
> (3, Vectors.dense(1.0, 0.0, 5.0), 3.0),
> (4, Vectors.dense(0.0, 0.0, 5.0), 4.0))
> ).toDF("id", "features", "label")
> val scaler = new MinMaxScaler()
>  .setInputCol("features")
>  .setOutputCol("features_scaled")
>  .setMin(10)
>  .setMax(0)
> val pipeline = new Pipeline().setStages(Array(scaler))
> pipeline.fit(df)
> {code}
> validateParams() should be called by 
> PipelineStage(Pipeline/Estimator/Transformer) automatically, so I propose to 
> put it in transformSchema(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11553) row.getInt(i) if row[i]=null returns 0

2015-11-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11553.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9642
[https://github.com/apache/spark/pull/9642]

> row.getInt(i) if row[i]=null returns 0
> --
>
> Key: SPARK-11553
> URL: https://issues.apache.org/jira/browse/SPARK-11553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tofigh
>Priority: Blocker
> Fix For: 1.6.0
>
>
> row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even 
> according to the document they should throw nullException error)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-11-16 Thread Charles Allen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007640#comment-15007640
 ] 

Charles Allen commented on SPARK-11016:
---

[~davies] Was in a meeting, looks like you got it :)

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
> Fix For: 1.6.0
>
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11766) JSON serialization of Vectors

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11766:


Assignee: Xiangrui Meng  (was: Apache Spark)

> JSON serialization of Vectors
> -
>
> Key: SPARK-11766
> URL: https://issues.apache.org/jira/browse/SPARK-11766
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We want to support JSON serialization of vectors in order to support 
> SPARK-11764.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11742) Show batch failures in the Streaming UI landing page

2015-11-16 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-11742:
--
Assignee: Shixiong Zhu

> Show batch failures in the Streaming UI landing page
> 
>
> Key: SPARK-11742
> URL: https://issues.apache.org/jira/browse/SPARK-11742
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11762) TransportResponseHandler should consider open streams when counting outstanding requests

2015-11-16 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15007604#comment-15007604
 ] 

Apache Spark commented on SPARK-11762:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9747

> TransportResponseHandler should consider open streams when counting 
> outstanding requests
> 
>
> Key: SPARK-11762
> URL: https://issues.apache.org/jira/browse/SPARK-11762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> This code in TransportResponseHandler:
> {code}
>   public int numOutstandingRequests() {
> return outstandingFetches.size() + outstandingRpcs.size();
>   }
> {code}
> Is used to determine if the channel is currently in use; if there's a timeout 
> and the channel is in use, then the channel is closed. But it currently does 
> not consider open streams (just block fetches and RPCs), so if a timeout 
> happens during a stream transfer, the channel will remain open.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11762) TransportResponseHandler should consider open streams when counting outstanding requests

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11762:


Assignee: (was: Apache Spark)

> TransportResponseHandler should consider open streams when counting 
> outstanding requests
> 
>
> Key: SPARK-11762
> URL: https://issues.apache.org/jira/browse/SPARK-11762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> This code in TransportResponseHandler:
> {code}
>   public int numOutstandingRequests() {
> return outstandingFetches.size() + outstandingRpcs.size();
>   }
> {code}
> Is used to determine if the channel is currently in use; if there's a timeout 
> and the channel is in use, then the channel is closed. But it currently does 
> not consider open streams (just block fetches and RPCs), so if a timeout 
> happens during a stream transfer, the channel will remain open.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11762) TransportResponseHandler should consider open streams when counting outstanding requests

2015-11-16 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11762:


Assignee: Apache Spark

> TransportResponseHandler should consider open streams when counting 
> outstanding requests
> 
>
> Key: SPARK-11762
> URL: https://issues.apache.org/jira/browse/SPARK-11762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> This code in TransportResponseHandler:
> {code}
>   public int numOutstandingRequests() {
> return outstandingFetches.size() + outstandingRpcs.size();
>   }
> {code}
> Is used to determine if the channel is currently in use; if there's a timeout 
> and the channel is in use, then the channel is closed. But it currently does 
> not consider open streams (just block fetches and RPCs), so if a timeout 
> happens during a stream transfer, the channel will remain open.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11725) Let UDF to handle null value

2015-11-16 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11725:
-
Target Version/s: 1.6.0
Priority: Blocker  (was: Major)
  Issue Type: Bug  (was: Improvement)

> Let UDF to handle null value
> 
>
> Key: SPARK-11725
> URL: https://issues.apache.org/jira/browse/SPARK-11725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Jeff Zhang
>Priority: Blocker
>
> I notice that currently spark will take the long field as -1 if it is null.
> Here's the sample code.
> {code}
> sqlContext.udf.register("f", (x:Int)=>x+1)
> df.withColumn("age2", expr("f(age)")).show()
>  Output ///
> ++---++
> | age|   name|age2|
> ++---++
> |null|Michael|   0|
> |  30|   Andy|  31|
> |  19| Justin|  20|
> ++---++
> {code}
> I think for the null value we have 3 options
> * Use a special value to represent it (what spark does now)
> * Always return null if the udf input has null value argument 
> * Let udf itself to handle null
> I would prefer the third option 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11763) Refactoring to create template for Estimator, Model pairs

2015-11-16 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-11763:
-

 Summary: Refactoring to create template for Estimator, Model pairs
 Key: SPARK-11763
 URL: https://issues.apache.org/jira/browse/SPARK-11763
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


Add save/load to LogisticRegression Estimator, and refactor tests a little to 
make it easier to add similar support to other Estimator, Model pairs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 236 matches

Mail list logo