date:20160121

[jira] [Updated] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasi updated SPARK-12741:
-
Description: 
Hi,
I'm updating my report.

I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I have 
2 method, one for collect data and other for count.

method doQuery looks like:
{code}
dataFrame.collect()
{code}
method doQueryCount looks like:
{code}
dataFrame.count()
{code}

I have few scenarios with few results:
1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
2) 3 rows exists results: count 0 and collect 3.
3) 5 rows exists results: count 2 and collect 5. 

I tried to change the count code to the below code, but got the same results as 
I mentioned above.
{code}
dataFrame.sql("select count(*) from tbl").count/collect[0]
{code}


Thanks,
Sasi

  was:
Hi,
I'm updating my report.

I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I have 
2 method, one for collect data and other for count.

method doQuery looks like:
{code}
subscribersDataFrame.collect()
{code}
method doQueryCount looks like:
{code}
subscribersDataFrame.count()
{code}

I have few scenarios with few results:
1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
2) 3 rows exists results: count 0 and collect 3.
3) 5 rows exists results: count 2 and collect 5. 

I tried to change the count code to the below code, but got the same results as 
I mentioned above.
{code}
subscribersDataFrame.sql("select count(*) from tbl").count/collect[0]
{code}


Thanks,
Sasi


> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12954) pyspark API 1.3.0 how we can patitionning by columns

2016-01-21 Thread malouke (JIRA)

malouke created SPARK-12954:
---

 Summary: pyspark API 1.3.0  how we can patitionning by columns  
 Key: SPARK-12954
 URL: https://issues.apache.org/jira/browse/SPARK-12954
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
 Environment: spark 1.3.0
cloudera manger 
linux platfrome
pyspark  
Reporter: malouke
Priority: Blocker


hi,
before posting this question i try lot of things , but i dont found solution.

i have 9 table and i join thems with two ways:
 -1 first test with df.join(df2, df.id == df.id2,'left_outer')
-2 sqlcontext.sql("select * from t1 left join t2 on  id_t1=id_t2")

after that i want  partition by date the result of join :
-in pyspark 1.5.2 i try partitionBy if table it's not comming from result of at 
most two tables evry thiings ok. but when i  join more than three tables i dont 
have result after severals hours .
- in pyspark 1.3.0 i dont found in api one function let me  partition by dat 
columns 


Q: some one can help me to resolve this probleme  
thank you in advance 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12954) pyspark API 1.3.0 how we can patitionning by columns

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12954.
---
  Resolution: Invalid
Target Version/s:   (was: 1.3.0)

[~Malouke] a lot is wrong with this. Please don't open a JIRA until you read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Don't set Blocker or Target Version. Ask questions on u...@spark.apache.org

> pyspark API 1.3.0  how we can patitionning by columns  
> ---
>
> Key: SPARK-12954
> URL: https://issues.apache.org/jira/browse/SPARK-12954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: spark 1.3.0
> cloudera manger 
> linux platfrome
> pyspark  
>Reporter: malouke
>Priority: Blocker
>  Labels: documentation, features, performance, test
>
> hi,
> before posting this question i try lot of things , but i dont found solution.
> i have 9 table and i join thems with two ways:
>  -1 first test with df.join(df2, df.id == df.id2,'left_outer')
> -2 sqlcontext.sql("select * from t1 left join t2 on  id_t1=id_t2")
> after that i want  partition by date the result of join :
> -in pyspark 1.5.2 i try partitionBy if table it's not comming from result of 
> at most two tables evry thiings ok. but when i  join more than three tables i 
> dont have result after severals hours .
> - in pyspark 1.3.0 i dont found in api one function let me  partition by dat 
> columns 
> Q: some one can help me to resolve this probleme  
> thank you in advance 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2309) Generalize the binary logistic regression into multinomial logistic regression

2016-01-21 Thread Daniel Darabos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110628#comment-15110628
 ] 

Daniel Darabos commented on SPARK-2309:
---

https://github.com/apache/spark/blob/v1.6.0/docs/ml-classification-regression.md#logistic-regression
 still says:

> The current implementation of logistic regression in spark.ml only supports 
> binary classes. Support for multiclass regression will be added in the future.

That can be removed now, right?

> Generalize the binary logistic regression into multinomial logistic regression
> --
>
> Key: SPARK-2309
> URL: https://issues.apache.org/jira/browse/SPARK-2309
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Critical
> Fix For: 1.3.0
>
>
> Currently, there is no multi-class classifier in mllib. Logistic regression 
> can be extended to multinomial one straightforwardly. 
> The following formula will be implemented. 
> http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110523#comment-15110523
 ] 

Sasi commented on SPARK-12741:
--

I changed the way I used the DataFrame from my last ticket.
Now, I have dataframe without any cache or persist operation, so each time I 
add/remove row I see it on the dataframe.
The only problem I'm having right now is the count operation.
I'll try to create new dataframe for each count calls, maybe it will resolve 
the problem.

By the way, I see the count result change, e.g. from 3 to 5, but the count size 
not equals to the real dataframe size.

Thanks again!
Sasi


> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110553#comment-15110553
 ] 

Sasi commented on SPARK-12741:
--

Create new DataFame didn't resolve the issue.
I still think its bug.

Thanks,
Sasi

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110584#comment-15110584
 ] 

Sasi commented on SPARK-12741:
--

I checked my DB which is Aerospike, and I got the same results of my collect.
I'm creating my DataFrame with Aerospark which is a connector written by Sasha, 
https://github.com/sasha-polev/aerospark/
I'm using the DataFrame actions as describe in sql-programming-guide, 
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html

I know there're two ways to do actions on DataFrame:
1) SQL way.
{code}
dataFrame.sqlContext().sql("select count(*) from tbl").collect()[0]
{code}
2) DataFrame way.
{code}
dataFrame.where("...").count()
{code}

I'm using the DataFrame way which is more simple to understand and to read as a 
JAVA code.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sasi updated SPARK-12741:
-
Description: 
Hi,
I'm updating my report.

I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I have 
2 method, one for collect data and other for count.

method doQuery looks like:
{code}
subscribersDataFrame.collect()
{code}
method doQueryCount looks like:
{code}
subscribersDataFrame.count()
{code}

I have few scenarios with few results:
1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
2) 3 rows exists results: count 0 and collect 3.
3) 5 rows exists results: count 2 and collect 5. 

I tried to change the count code to the below code, but got the same results as 
I mentioned above.
{code}
subscribersDataFrame.sql("select count(*) from tbl").count/collect[0]
{code}


Thanks,
Sasi

  was:
Hi,
I noted that DataFrame count method always return wrong size.
Assume I have 11 records.
When running dataframe.count() I get 9.
Also if I'm running dataframe.collectAsList() then i'll get 9 records instead 
of 11.
But if I run dataframe.collect() then i'll get 11.

Thanks,
Sasi


> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> subscribersDataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> subscribersDataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> subscribersDataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-21 Thread dileep (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110512#comment-15110512
 ] 

dileep commented on SPARK-12843:


I will look  in to this issue

> Spark should avoid scanning all partitions when limit is set
> 
>
> Key: SPARK-12843
> URL: https://issues.apache.org/jira/browse/SPARK-12843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> SQL Query:
> {code}
> select * from table limit 100
> {code}
> force Spark to scan all partition even when data are available on the 
> beginning of scan.
> This behaviour should be avoided and scan should stop when enough data is 
> collected.
> Is it related to: [SPARK-9850] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12954) pyspark API 1.3.0 how we can patitionning by columns

2016-01-21 Thread malouke (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110596#comment-15110596
 ] 

malouke commented on SPARK-12954:
-

ok sorry,

> pyspark API 1.3.0  how we can patitionning by columns  
> ---
>
> Key: SPARK-12954
> URL: https://issues.apache.org/jira/browse/SPARK-12954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: spark 1.3.0
> cloudera manger 
> linux platfrome
> pyspark  
>Reporter: malouke
>Priority: Blocker
>  Labels: documentation, features, performance, test
>
> hi,
> before posting this question i try lot of things , but i dont found solution.
> i have 9 table and i join thems with two ways:
>  -1 first test with df.join(df2, df.id == df.id2,'left_outer')
> -2 sqlcontext.sql("select * from t1 left join t2 on  id_t1=id_t2")
> after that i want  partition by date the result of join :
> -in pyspark 1.5.2 i try partitionBy if table it's not comming from result of 
> at most two tables evry thiings ok. but when i  join more than three tables i 
> dont have result after severals hours .
> - in pyspark 1.3.0 i dont found in api one function let me  partition by dat 
> columns 
> Q: some one can help me to resolve this probleme  
> thank you in advance 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110563#comment-15110563
 ] 

Sasi commented on SPARK-12741:
--

If I'm running the following code:
{code}
dataFrame.where("...").count()
{code}

The result is the same as the collect.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12954) pyspark API 1.3.0 how we can patitionning by columns

2016-01-21 Thread malouke (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110603#comment-15110603
 ] 

malouke commented on SPARK-12954:
-

hi sean,
where i can ask question ?

> pyspark API 1.3.0  how we can patitionning by columns  
> ---
>
> Key: SPARK-12954
> URL: https://issues.apache.org/jira/browse/SPARK-12954
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: spark 1.3.0
> cloudera manger 
> linux platfrome
> pyspark  
>Reporter: malouke
>Priority: Blocker
>  Labels: documentation, features, performance, test
>
> hi,
> before posting this question i try lot of things , but i dont found solution.
> i have 9 table and i join thems with two ways:
>  -1 first test with df.join(df2, df.id == df.id2,'left_outer')
> -2 sqlcontext.sql("select * from t1 left join t2 on  id_t1=id_t2")
> after that i want  partition by date the result of join :
> -in pyspark 1.5.2 i try partitionBy if table it's not comming from result of 
> at most two tables evry thiings ok. but when i  join more than three tables i 
> dont have result after severals hours .
> - in pyspark 1.3.0 i dont found in api one function let me  partition by dat 
> columns 
> Q: some one can help me to resolve this probleme  
> thank you in advance 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110572#comment-15110572
 ] 

Sean Owen commented on SPARK-12741:
---

I can't reproduce this. I always get the same count and collect result on an 
example data set. Above I think you mean sqlContext not dataFrame right? this 
makes me wonder what you're really executing. Also, couldn't it be an issue 
with your DB?

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110503#comment-15110503
 ] 

Sasi commented on SPARK-12741:
--

I updated the report, can you verify it again.

Thanks!
Sasi

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> subscribersDataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> subscribersDataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> subscribersDataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110613#comment-15110613
 ] 

Sasi commented on SPARK-12741:
--

Addtional update:

If I use the following code, then I get the same length of the collect.
{code}
subscribersDataFrame.rdd().count();
{code}

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110511#comment-15110511
 ] 

Sean Owen commented on SPARK-12741:
---

I recall from other JIRAs that you're not updating the DataFrame / RDD but 
updating the table? that's not going to cause the result to change -- or it 
may, it's undefined.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2016-01-21 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-12247:
---
Assignee: Benjamin Fradet

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>Assignee: Benjamin Fradet
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"

2016-01-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12953:


Assignee: Apache Spark

> RDDRelation write set mode will be better to avoid error "pair.parquet 
> already exists"
> --
>
> Key: SPARK-12953
> URL: https://issues.apache.org/jira/browse/SPARK-12953
> Project: Spark
>  Issue Type: Wish
>  Components: Examples
>Reporter: shijinkui
>Assignee: Apache Spark
> Fix For: 1.6.1
>
>
> It will be error if not set Write Mode when execute test case 
> `RDDRelation.main()`
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
> file:/Users/sjk/pair.parquet already exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)
>   at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65)
>   at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"

2016-01-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110301#comment-15110301
 ] 

Apache Spark commented on SPARK-12953:
--

User 'shijinkui' has created a pull request for this issue:
https://github.com/apache/spark/pull/10864

> RDDRelation write set mode will be better to avoid error "pair.parquet 
> already exists"
> --
>
> Key: SPARK-12953
> URL: https://issues.apache.org/jira/browse/SPARK-12953
> Project: Spark
>  Issue Type: Wish
>  Components: Examples
>Reporter: shijinkui
> Fix For: 1.6.1
>
>
> It will be error if not set Write Mode when execute test case 
> `RDDRelation.main()`
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
> file:/Users/sjk/pair.parquet already exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)
>   at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65)
>   at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"

2016-01-21 Thread shijinkui (JIRA)

shijinkui created SPARK-12953:
-

 Summary: RDDRelation write set mode will be better to avoid error 
"pair.parquet already exists"
 Key: SPARK-12953
 URL: https://issues.apache.org/jira/browse/SPARK-12953
 Project: Spark
  Issue Type: Wish
  Components: SQL
Reporter: shijinkui
 Fix For: 1.6.1


It will be error if not set Write Mode when execute test case 
`RDDRelation.main()`

Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
file:/Users/sjk/pair.parquet already exists.;
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)
at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65)
at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala)





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1

2016-01-21 Thread Sasi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110354#comment-15110354
 ] 

Sasi commented on SPARK-12906:
--

Looks like fixed on 1.5.2.
Thanks!

> LongSQLMetricValue cause memory leak on Spark 1.5.1
> ---
>
> Key: SPARK-12906
> URL: https://issues.apache.org/jira/browse/SPARK-12906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Sasi
> Attachments: dump1.PNG, screenshot-1.png
>
>
> Hi,
> I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the 
> scala.util.parsing.combinator.Parser$$anon$3 cause memory leak.
> Now, after doing another dump heap I notice, after 2 hours, that 
> LongSQLMetricValue cause memory leak.
> Didn't see any bug or document about it.
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12906.
---
Resolution: Duplicate

> LongSQLMetricValue cause memory leak on Spark 1.5.1
> ---
>
> Key: SPARK-12906
> URL: https://issues.apache.org/jira/browse/SPARK-12906
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Sasi
> Attachments: dump1.PNG, screenshot-1.png
>
>
> Hi,
> I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the 
> scala.util.parsing.combinator.Parser$$anon$3 cause memory leak.
> Now, after doing another dump heap I notice, after 2 hours, that 
> LongSQLMetricValue cause memory leak.
> Didn't see any bug or document about it.
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2016-01-21 Thread Nick Pentreath (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Pentreath updated SPARK-12247:
---
Affects Version/s: (was: 1.5.2)
   2.0.0

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 2.0.0
>Reporter: Timothy Hunter
>Assignee: Benjamin Fradet
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-21 Thread Felix Cheung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110299#comment-15110299
 ] 

Felix Cheung commented on SPARK-6817:
-

Thanks for putting together on the doc [~sunrui]
In this design, how does one control the partitioning? For instance, suppose 
one would like to group census data DataFrame by a certain column, say 
MetropolitanArea, and then pass to R's kmeans to cluster residents within 
close-by geographical areas. In order for the R UDFs to be effective, in this 
and some other cases, one would need to make sure the data is partition 
appropriately, and that mapPartition would produce a local R data.frame 
(assuming it fits into memory) that has all the relevant data in it?
 

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"

2016-01-21 Thread shijinkui (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shijinkui updated SPARK-12953:
--
Component/s: (was: SQL)
 Examples

> RDDRelation write set mode will be better to avoid error "pair.parquet 
> already exists"
> --
>
> Key: SPARK-12953
> URL: https://issues.apache.org/jira/browse/SPARK-12953
> Project: Spark
>  Issue Type: Wish
>  Components: Examples
>Reporter: shijinkui
> Fix For: 1.6.1
>
>
> It will be error if not set Write Mode when execute test case 
> `RDDRelation.main()`
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
> file:/Users/sjk/pair.parquet already exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)
>   at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65)
>   at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"

2016-01-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12953:


Assignee: (was: Apache Spark)

> RDDRelation write set mode will be better to avoid error "pair.parquet 
> already exists"
> --
>
> Key: SPARK-12953
> URL: https://issues.apache.org/jira/browse/SPARK-12953
> Project: Spark
>  Issue Type: Wish
>  Components: Examples
>Reporter: shijinkui
> Fix For: 1.6.1
>
>
> It will be error if not set Write Mode when execute test case 
> `RDDRelation.main()`
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
> file:/Users/sjk/pair.parquet already exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)
>   at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65)
>   at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-21 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110359#comment-15110359
 ] 

Sun Rui commented on SPARK-6817:


for dapply(), user can call repartition() to set an appropriate number of 
partitions before calling dapply().

for gapply(), the SQL conf "spark.sql.shuffle.partitions" could be used to tune 
the partitions number after shuffle. I am also hoping SPARK-9850 Adaptive 
execution in Spark could help.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-21 Thread dileep (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110712#comment-15110712
 ] 

dileep edited comment on SPARK-12843 at 1/21/16 2:57 PM:
-

Its a caching issue, while scanning the table need to cache the records, so 
from next onwards it wont take much time


was (Author: dileep):
Its a caching issue, while scanning the table need to cache the records, so 
from mext onwards it wont take much time

> Spark should avoid scanning all partitions when limit is set
> 
>
> Key: SPARK-12843
> URL: https://issues.apache.org/jira/browse/SPARK-12843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> SQL Query:
> {code}
> select * from table limit 100
> {code}
> force Spark to scan all partition even when data are available on the 
> beginning of scan.
> This behaviour should be avoided and scan should stop when enough data is 
> collected.
> Is it related to: [SPARK-9850] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12932) Bad error message with trying to create Dataset from RDD of Java objects that are not bean-compliant

2016-01-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110786#comment-15110786
 ] 

Sean Owen commented on SPARK-12932:
---

OK, do you want to make a PR for that?

> Bad error message with trying to create Dataset from RDD of Java objects that 
> are not bean-compliant
> 
>
> Key: SPARK-12932
> URL: https://issues.apache.org/jira/browse/SPARK-12932
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.0
> Environment: Ubuntu 15.10 / Java 8
>Reporter: Andy Grove
>
> When trying to create a Dataset from an RDD of Person (all using the Java 
> API), I got the error "java.lang.UnsupportedOperationException: no encoder 
> found for example_java.dataset.Person". This is not a very helpful error and 
> no other logging information was apparent to help troubleshoot this.
> It turned out that the problem was that my Person class did not have a 
> default constructor and also did not have setter methods and that was the 
> root cause.
> This JIRA is for implementing a more usful error message to help Java 
> developers who are trying out the Dataset API for the first time.
> The full stack trace is:
> {code}
> Exception in thread "main" java.lang.UnsupportedOperationException: no 
> encoder found for example_java.common.Person
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$extractorFor(JavaTypeInference.scala:403)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.extractorsFor(JavaTypeInference.scala:314)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75)
>   at org.apache.spark.sql.Encoders$.bean(Encoder.scala:176)
>   at org.apache.spark.sql.Encoders.bean(Encoder.scala)
> {code}
> NOTE that if I do provide EITHER the default constructor OR the setters, but 
> not both, then I get a stack trace with much more useful information, but 
> omitting BOTH causes this issue.
> The original source is below.
> {code:title=Example.java}
> public class JavaDatasetExample {
> public static void main(String[] args) throws Exception {
> SparkConf sparkConf = new SparkConf()
> .setAppName("Example")
> .setMaster("local[*]");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SQLContext sqlContext = new SQLContext(sc);
> List people = ImmutableList.of(
> new Person("Joe", "Bloggs", 21, "NY")
> );
> Dataset dataset = sqlContext.createDataset(people, 
> Encoders.bean(Person.class));
> {code}
> {code:title=Person.java}
> class Person implements Serializable {
> String first;
> String last;
> int age;
> String state;
> public Person() {
> }
> public Person(String first, String last, int age, String state) {
> this.first = first;
> this.last = last;
> this.age = age;
> this.state = state;
> }
> public String getFirst() {
> return first;
> }
> public String getLast() {
> return last;
> }
> public int getAge() {
> return age;
> }
> public String getState() {
> return state;
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12932) Bad error message with trying to create Dataset from RDD of Java objects that are not bean-compliant

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12932:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Bad error message with trying to create Dataset from RDD of Java objects that 
> are not bean-compliant
> 
>
> Key: SPARK-12932
> URL: https://issues.apache.org/jira/browse/SPARK-12932
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 1.6.0
> Environment: Ubuntu 15.10 / Java 8
>Reporter: Andy Grove
>Priority: Minor
>
> When trying to create a Dataset from an RDD of Person (all using the Java 
> API), I got the error "java.lang.UnsupportedOperationException: no encoder 
> found for example_java.dataset.Person". This is not a very helpful error and 
> no other logging information was apparent to help troubleshoot this.
> It turned out that the problem was that my Person class did not have a 
> default constructor and also did not have setter methods and that was the 
> root cause.
> This JIRA is for implementing a more usful error message to help Java 
> developers who are trying out the Dataset API for the first time.
> The full stack trace is:
> {code}
> Exception in thread "main" java.lang.UnsupportedOperationException: no 
> encoder found for example_java.common.Person
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$extractorFor(JavaTypeInference.scala:403)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.extractorsFor(JavaTypeInference.scala:314)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75)
>   at org.apache.spark.sql.Encoders$.bean(Encoder.scala:176)
>   at org.apache.spark.sql.Encoders.bean(Encoder.scala)
> {code}
> NOTE that if I do provide EITHER the default constructor OR the setters, but 
> not both, then I get a stack trace with much more useful information, but 
> omitting BOTH causes this issue.
> The original source is below.
> {code:title=Example.java}
> public class JavaDatasetExample {
> public static void main(String[] args) throws Exception {
> SparkConf sparkConf = new SparkConf()
> .setAppName("Example")
> .setMaster("local[*]");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SQLContext sqlContext = new SQLContext(sc);
> List people = ImmutableList.of(
> new Person("Joe", "Bloggs", 21, "NY")
> );
> Dataset dataset = sqlContext.createDataset(people, 
> Encoders.bean(Person.class));
> {code}
> {code:title=Person.java}
> class Person implements Serializable {
> String first;
> String last;
> int age;
> String state;
> public Person() {
> }
> public Person(String first, String last, int age, String state) {
> this.first = first;
> this.last = last;
> this.age = age;
> this.state = state;
> }
> public String getFirst() {
> return first;
> }
> public String getLast() {
> return last;
> }
> public int getAge() {
> return age;
> }
> public String getState() {
> return state;
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-01-21 Thread Emlyn Corrin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110688#comment-15110688
 ] 

Emlyn Corrin commented on SPARK-9740:
-

How do you use FIRST/LAST from the Java API with ignoreNulls now? I can't find 
a way to specify ignoreNulls=true.

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110676#comment-15110676
 ] 

Sean Owen edited comment on SPARK-12741 at 1/21/16 2:40 PM:


Wait, is this what you mean? {{select count(*) ...}} returns 1 row, which 
contains a number, which is the number of rows matching the predicate. 
{{count()}} returns 1 because there is 1 row in the result set. {{collect()}} 
collects (an Array of) that number of rows. Those are different; they're 
different things. That seems to be your problem.


was (Author: srowen):
Wait, is this what you mean? "select count(*) ..." returns 1 row, which 
contains a number, which is the number of rows matching the predicate. count() 
returns 1 because there is 1 row in the result set. collect() collects (an 
Array of) that number of rows. Those are different; they're different things. 
That seems to be your problem.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-21 Thread dileep (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110709#comment-15110709
 ] 

dileep commented on SPARK-12843:


Please see the above Code. We need to make use of caching mechanism of the data 
frame. 
DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1");
teenagers.cache();

Which is making significant improvement in the select query. So for subsequent 
select query it wont select the entire data


> Spark should avoid scanning all partitions when limit is set
> 
>
> Key: SPARK-12843
> URL: https://issues.apache.org/jira/browse/SPARK-12843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> SQL Query:
> {code}
> select * from table limit 100
> {code}
> force Spark to scan all partition even when data are available on the 
> beginning of scan.
> This behaviour should be avoided and scan should stop when enough data is 
> collected.
> Is it related to: [SPARK-9850] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-21 Thread dileep (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110709#comment-15110709
 ] 

dileep edited comment on SPARK-12843 at 1/21/16 3:08 PM:
-

When I verified with 2 lakhs records, I am able to check the milliseconds 
difference through the program, which clearly says its not scanning entire 
records.

But you improve the query performance by using dataframe object's cache method. 
I can see significant improvemnt in the performance of query.

Please see the below Code snippet. We need to make use of caching mechanism of 
the data frame. 
DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1");
teenagers.cache();

Which is making significant improvement in the select query. So for subsequent 
select query it wont select the entire data



was (Author: dileep):
Please see the below Code snippet. We need to make use of caching mechanism of 
the data frame. 
DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1");
teenagers.cache();

Which is making significant improvement in the select query. So for subsequent 
select query it wont select the entire data


> Spark should avoid scanning all partitions when limit is set
> 
>
> Key: SPARK-12843
> URL: https://issues.apache.org/jira/browse/SPARK-12843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> SQL Query:
> {code}
> select * from table limit 100
> {code}
> force Spark to scan all partition even when data are available on the 
> beginning of scan.
> This behaviour should be avoided and scan should stop when enough data is 
> collected.
> Is it related to: [SPARK-9850] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12760) inaccurate description for difference between local vs cluster mode in closure handling

2016-01-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12760:


Assignee: Apache Spark

> inaccurate description for difference between local vs cluster mode in 
> closure handling
> ---
>
> Key: SPARK-12760
> URL: https://issues.apache.org/jira/browse/SPARK-12760
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Mortada Mehyar
>Assignee: Apache Spark
>Priority: Minor
>
> In the spark documentation there's an example for illustrating how `local` 
> and `cluster` mode can differ 
> http://spark.apache.org/docs/latest/programming-guide.html#example
> " In local mode with a single JVM, the above code will sum the values within 
> the RDD and store it in counter. This is because both the RDD and the 
> variable counter are in the same memory space on the driver node." 
> However the above doesn't seem to be true. Even in `local` mode it seems like 
> the counter value should still be 0, because the variable will be summed up 
> in the executor memory space, but the final value in the driver memory space 
> is still 0. I tested this snippet and verified that in `local` mode the value 
> is indeed still 0. 
> Is the doc wrong or perhaps I'm missing something the doc is trying to say? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12760) inaccurate description for difference between local vs cluster mode in closure handling

2016-01-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12760:


Assignee: (was: Apache Spark)

> inaccurate description for difference between local vs cluster mode in 
> closure handling
> ---
>
> Key: SPARK-12760
> URL: https://issues.apache.org/jira/browse/SPARK-12760
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Mortada Mehyar
>Priority: Minor
>
> In the spark documentation there's an example for illustrating how `local` 
> and `cluster` mode can differ 
> http://spark.apache.org/docs/latest/programming-guide.html#example
> " In local mode with a single JVM, the above code will sum the values within 
> the RDD and store it in counter. This is because both the RDD and the 
> variable counter are in the same memory space on the driver node." 
> However the above doesn't seem to be true. Even in `local` mode it seems like 
> the counter value should still be 0, because the variable will be summed up 
> in the executor memory space, but the final value in the driver memory space 
> is still 0. I tested this snippet and verified that in `local` mode the value 
> is indeed still 0. 
> Is the doc wrong or perhaps I'm missing something the doc is trying to say? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-21 Thread dileep (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110712#comment-15110712
 ] 

dileep commented on SPARK-12843:


Its a caching issue, while scanning the table need to cache the records, so 
from mext onwards it wont take much time

> Spark should avoid scanning all partitions when limit is set
> 
>
> Key: SPARK-12843
> URL: https://issues.apache.org/jira/browse/SPARK-12843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> SQL Query:
> {code}
> select * from table limit 100
> {code}
> force Spark to scan all partition even when data are available on the 
> beginning of scan.
> This behaviour should be avoided and scan should stop when enough data is 
> collected.
> Is it related to: [SPARK-9850] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10262) Add @Since annotation to ml.attribute

2016-01-21 Thread Tommy Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110743#comment-15110743
 ] 

Tommy Yu commented on SPARK-10262:
--

HI Xiangrui Meng 

I take a look all class under ml.attribute, those all are develop api. Only one 
no-develop api is "AttributeFactory" but it's private to this package either.

I though this task should be closed without any PR need. 

Can you please take a look?

Regards.
Yu Wenpei.

> Add @Since annotation to ml.attribute
> -
>
> Key: SPARK-10262
> URL: https://issues.apache.org/jira/browse/SPARK-10262
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12760) inaccurate description for difference between local vs cluster mode in closure handling

2016-01-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110814#comment-15110814
 ] 

Apache Spark commented on SPARK-12760:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10866

> inaccurate description for difference between local vs cluster mode in 
> closure handling
> ---
>
> Key: SPARK-12760
> URL: https://issues.apache.org/jira/browse/SPARK-12760
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Mortada Mehyar
>Priority: Minor
>
> In the spark documentation there's an example for illustrating how `local` 
> and `cluster` mode can differ 
> http://spark.apache.org/docs/latest/programming-guide.html#example
> " In local mode with a single JVM, the above code will sum the values within 
> the RDD and store it in counter. This is because both the RDD and the 
> variable counter are in the same memory space on the driver node." 
> However the above doesn't seem to be true. Even in `local` mode it seems like 
> the counter value should still be 0, because the variable will be summed up 
> in the executor memory space, but the final value in the driver memory space 
> is still 0. I tested this snippet and verified that in `local` mode the value 
> is indeed still 0. 
> Is the doc wrong or perhaps I'm missing something the doc is trying to say? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sasi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110701#comment-15110701
 ] 

Sasi commented on SPARK-12741:
--

That's not what I meant.
I just set an example for each case, SQL way and DataFrame way.
I know that count() on select count(*) return 1 row, that's why I wrote 
collect()[0] which give back the value.

As I said before: dataFrame.count() and dataFrame.where("...").count() results 
wrong size when running dataFrame.where("...").collect().length or 
dataFrame.collect().length.

I'm pretty sure that count() doesn't work as expected.

Sasi


> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110762#comment-15110762
 ] 

Sean Owen commented on SPARK-12741:
---

OK, that's what you wrote at the outset though. Then I can't reproduce it.  I 
always get the correct count both ways. {{where("...")}} isn't what you're 
really executing; what are you writing? are you sure that's not the problem? 
because you have no predicate in the query you're comparing to. It's important 
to be clear what you're comparing.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110762#comment-15110762
 ] 

Sean Owen edited comment on SPARK-12741 at 1/21/16 3:26 PM:


OK, that's different from what you wrote at the outset though. Then I can't 
reproduce it.  I always get the correct count both ways. {{where("...")}} isn't 
what you're really executing; what are you writing? are you sure that's not the 
problem? because you have no predicate in the query you're comparing to. It's 
important to be clear what you're comparing.


was (Author: srowen):
OK, that's what you wrote at the outset though. Then I can't reproduce it.  I 
always get the correct count both ways. {{where("...")}} isn't what you're 
really executing; what are you writing? are you sure that's not the problem? 
because you have no predicate in the query you're comparing to. It's important 
to be clear what you're comparing.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-21 Thread dileep (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110704#comment-15110704
 ] 

dileep commented on SPARK-12843:


public class JavaSparkSQL {

  public static class Person implements Serializable {
private String name;
private int age;

public String getName() {
  return name;
}

public void setName(String name) {
  this.name = name;
}

public int getAge() {
  return age;
}

public void setAge(int age) {
  this.age = age;
}
  }

  public static void main(String[] args) throws Exception {
long millis1 = System.currentTimeMillis() % 1000;
SparkConf sparkConf = new 
SparkConf().setAppName("JavaSparkSQL").setMaster("local[4]");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(ctx);
// Load a text file and convert each line to a Java Bean.
JavaRDD people = 
ctx.textFile("/home/394036/spark-1.6.0-bin-hadoop2.3/examples/src/main/resources/people_1.txt").map(
  new Function() {
@Override
public Person call(String line) {
  String[] parts = line.split(",");

  Person person = new Person();
  person.setName(parts[0]);
  person.setAge(Integer.parseInt(parts[1].trim()));

  return person;
}
  });
// Apply a schema to an RDD of Java Beans and register it as a table.
DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class);
schemaPeople.registerTempTable("people");
// SQL can be run over RDDs that have been registered as tables.
//DataFrame teenagers = sqlContext.sql("SELECT age, name FROM people WHERE 
age >= 13 AND age <= 19");
//DataFrame teenagers = sqlContext.sql("SELECT * FROM people");
DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1");
teenagers.cache();
// The results of SQL queries are DataFrames and support all the normal RDD 
operations.
// The columns of a row in the result can be accessed by ordinal.
List teenagerNames = teenagers.toJavaRDD().map(new Function() {
  @Override
  public People call(Row row) {long millis2 = System.currentTimeMillis() % 
1000;
  People people = new People();
  people.setAge(row.getInt(0));
  people.setName(row.getString(1));
  //System.out.println(people.toString());
return people;
  }
}).collect();

long millis2 = System.currentTimeMillis() % 1000;
long millis3 = millis2 - millis1;
System.out.println("difference = "+String.valueOf(millis3));
/*

for (String name: teenagerNames) {
  System.out.println("=>"+name);
}
*/



/*
System.out.println("=== Data source: Parquet File ===");
// DataFrames can be saved as parquet files, maintaining the schema 
information.
schemaPeople.write().parquet("people.parquet");

// Read in the parquet file created above.
// Parquet files are self-describing so the schema is preserved.
// The result of loading a parquet file is also a DataFrame.
DataFrame parquetFile = sqlContext.read().parquet("people.parquet");

//Parquet files can also be registered as tables and then used in SQL 
statements.
parquetFile.registerTempTable("parquetFile");

DataFrame teenagers2 = sqlContext.sql("SELECT name FROM parquetFile WHERE 
age >= 13 AND age <= 19");

teenagerNames = teenagers2.toJavaRDD().map(new Function() {
  @Override
  public String call(Row row) {
  return "Name: " + row.getString(0);
  }
}).collect();
for (String name: teenagerNames) {
  System.out.println(name);
}

System.out.println("=== Data source: JSON Dataset ===");
// A JSON dataset is pointed by path.
// The path can be either a single text file or a directory storing text 
files.
String path = 
"/home/394036/spark-1.6.0-bin-hadoop2.3/examples/src/main/resources/people.json";
// Create a DataFrame from the file(s) pointed by path
DataFrame peopleFromJsonFile = sqlContext.read().json(path);

// Because the schema of a JSON dataset is automatically inferred, to write 
queries,
// it is better to take a look at what is the schema.
peopleFromJsonFile.printSchema();
// The schema of people is ...
// root
//  |-- age: IntegerType
//  |-- name: StringType

// Register this DataFrame as a table.
peopleFromJsonFile.registerTempTable("people");

// SQL statements can be run by using the sql methods provided by 
sqlContext.
DataFrame teenagers3 = sqlContext.sql("SELECT name FROM people WHERE age >= 
13 AND age <= 19");

// The results of SQL queries are DataFrame and support all the normal RDD 
operations.
// The columns of a row in the result can be accessed by ordinal.

[jira] [Issue Comment Deleted] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-21 Thread dileep (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dileep updated SPARK-12843:
---
Comment: was deleted

(was: public class JavaSparkSQL {

  public static class Person implements Serializable {
private String name;
private int age;

public String getName() {
  return name;
}

public void setName(String name) {
  this.name = name;
}

public int getAge() {
  return age;
}

public void setAge(int age) {
  this.age = age;
}
  }

  public static void main(String[] args) throws Exception {
long millis1 = System.currentTimeMillis() % 1000;
SparkConf sparkConf = new 
SparkConf().setAppName("JavaSparkSQL").setMaster("local[4]");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(ctx);
// Load a text file and convert each line to a Java Bean.
JavaRDD people = 
ctx.textFile("/home/394036/spark-1.6.0-bin-hadoop2.3/examples/src/main/resources/people_1.txt").map(
  new Function() {
@Override
public Person call(String line) {
  String[] parts = line.split(",");

  Person person = new Person();
  person.setName(parts[0]);
  person.setAge(Integer.parseInt(parts[1].trim()));

  return person;
}
  });
// Apply a schema to an RDD of Java Beans and register it as a table.
DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class);
schemaPeople.registerTempTable("people");
// SQL can be run over RDDs that have been registered as tables.
//DataFrame teenagers = sqlContext.sql("SELECT age, name FROM people WHERE 
age >= 13 AND age <= 19");
//DataFrame teenagers = sqlContext.sql("SELECT * FROM people");
DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1");
teenagers.cache();
// The results of SQL queries are DataFrames and support all the normal RDD 
operations.
// The columns of a row in the result can be accessed by ordinal.
List teenagerNames = teenagers.toJavaRDD().map(new Function() {
  @Override
  public People call(Row row) {long millis2 = System.currentTimeMillis() % 
1000;
  People people = new People();
  people.setAge(row.getInt(0));
  people.setName(row.getString(1));
  //System.out.println(people.toString());
return people;
  }
}).collect();

long millis2 = System.currentTimeMillis() % 1000;
long millis3 = millis2 - millis1;
System.out.println("difference = "+String.valueOf(millis3));
/*

for (String name: teenagerNames) {
  System.out.println("=>"+name);
}
*/



/*
System.out.println("=== Data source: Parquet File ===");
// DataFrames can be saved as parquet files, maintaining the schema 
information.
schemaPeople.write().parquet("people.parquet");

// Read in the parquet file created above.
// Parquet files are self-describing so the schema is preserved.
// The result of loading a parquet file is also a DataFrame.
DataFrame parquetFile = sqlContext.read().parquet("people.parquet");

//Parquet files can also be registered as tables and then used in SQL 
statements.
parquetFile.registerTempTable("parquetFile");

DataFrame teenagers2 = sqlContext.sql("SELECT name FROM parquetFile WHERE 
age >= 13 AND age <= 19");

teenagerNames = teenagers2.toJavaRDD().map(new Function() {
  @Override
  public String call(Row row) {
  return "Name: " + row.getString(0);
  }
}).collect();
for (String name: teenagerNames) {
  System.out.println(name);
}

System.out.println("=== Data source: JSON Dataset ===");
// A JSON dataset is pointed by path.
// The path can be either a single text file or a directory storing text 
files.
String path = 
"/home/394036/spark-1.6.0-bin-hadoop2.3/examples/src/main/resources/people.json";
// Create a DataFrame from the file(s) pointed by path
DataFrame peopleFromJsonFile = sqlContext.read().json(path);

// Because the schema of a JSON dataset is automatically inferred, to write 
queries,
// it is better to take a look at what is the schema.
peopleFromJsonFile.printSchema();
// The schema of people is ...
// root
//  |-- age: IntegerType
//  |-- name: StringType

// Register this DataFrame as a table.
peopleFromJsonFile.registerTempTable("people");

// SQL statements can be run by using the sql methods provided by 
sqlContext.
DataFrame teenagers3 = sqlContext.sql("SELECT name FROM people WHERE age >= 
13 AND age <= 19");

// The results of SQL queries are DataFrame and support all the normal RDD 
operations.
// The columns of a row in the result can be accessed by ordinal.
teenagerNames =

[jira] [Commented] (SPARK-12932) Bad error message with trying to create Dataset from RDD of Java objects that are not bean-compliant

2016-01-21 Thread Andy Grove (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110772#comment-15110772
 ] 

Andy Grove commented on SPARK-12932:


After reviewing the code for this, I think it is just a case of changing the 
error message text in 
`org.apache.spark.sql.catalyst.JavaTypeInference#extractorFor`. The error 
should be changed to:

{code}
s"Cannot infer type for Java class ${other.getName} because it is not 
bean-compliant"
{code}

> Bad error message with trying to create Dataset from RDD of Java objects that 
> are not bean-compliant
> 
>
> Key: SPARK-12932
> URL: https://issues.apache.org/jira/browse/SPARK-12932
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.0
> Environment: Ubuntu 15.10 / Java 8
>Reporter: Andy Grove
>
> When trying to create a Dataset from an RDD of Person (all using the Java 
> API), I got the error "java.lang.UnsupportedOperationException: no encoder 
> found for example_java.dataset.Person". This is not a very helpful error and 
> no other logging information was apparent to help troubleshoot this.
> It turned out that the problem was that my Person class did not have a 
> default constructor and also did not have setter methods and that was the 
> root cause.
> This JIRA is for implementing a more usful error message to help Java 
> developers who are trying out the Dataset API for the first time.
> The full stack trace is:
> {code}
> Exception in thread "main" java.lang.UnsupportedOperationException: no 
> encoder found for example_java.common.Person
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$extractorFor(JavaTypeInference.scala:403)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.extractorsFor(JavaTypeInference.scala:314)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75)
>   at org.apache.spark.sql.Encoders$.bean(Encoder.scala:176)
>   at org.apache.spark.sql.Encoders.bean(Encoder.scala)
> {code}
> NOTE that if I do provide EITHER the default constructor OR the setters, but 
> not both, then I get a stack trace with much more useful information, but 
> omitting BOTH causes this issue.
> The original source is below.
> {code:title=Example.java}
> public class JavaDatasetExample {
> public static void main(String[] args) throws Exception {
> SparkConf sparkConf = new SparkConf()
> .setAppName("Example")
> .setMaster("local[*]");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SQLContext sqlContext = new SQLContext(sc);
> List people = ImmutableList.of(
> new Person("Joe", "Bloggs", 21, "NY")
> );
> Dataset dataset = sqlContext.createDataset(people, 
> Encoders.bean(Person.class));
> {code}
> {code:title=Person.java}
> class Person implements Serializable {
> String first;
> String last;
> int age;
> String state;
> public Person() {
> }
> public Person(String first, String last, int age, String state) {
> this.first = first;
> this.last = last;
> this.age = age;
> this.state = state;
> }
> public String getFirst() {
> return first;
> }
> public String getLast() {
> return last;
> }
> public int getAge() {
> return age;
> }
> public String getState() {
> return state;
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12534) Document missing command line options to Spark properties mapping

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12534.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/10491

> Document missing command line options to Spark properties mapping
> -
>
> Key: SPARK-12534
> URL: https://issues.apache.org/jira/browse/SPARK-12534
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Documentation, YARN
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.0.0
>
>
> Several Spark properties equivalent to Spark submit command line options are 
> missing.
> {quote}
> The equivalent for spark-submit --num-executors should be 
> spark.executor.instances
> When use in SparkConf?
> http://spark.apache.org/docs/latest/running-on-yarn.html
> Could you try setting that with sparkR.init()?
> _
> From: Franc Carter 
> Sent: Friday, December 25, 2015 9:23 PM
> Subject: number of executors in sparkR.init()
> To: 
> Hi,
> I'm having trouble working out how to get the number of executors set when 
> using sparkR.init().
> If I start sparkR with
>   sparkR  --master yarn --num-executors 6 
> then I get 6 executors
> However, if start sparkR with
>   sparkR 
> followed by
>   sc <- sparkR.init(master="yarn-client",   
> sparkEnvir=list(spark.num.executors='6'))
> then I only get 2 executors.
> Can anyone point me in the direction of what I might doing wrong ? I need to 
> initialise this was so that rStudio can hook in to SparkR
> thanks
> -- 
> Franc
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-21 Thread dileep (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110712#comment-15110712
 ] 

dileep edited comment on SPARK-12843 at 1/21/16 2:59 PM:
-

Its a caching issue, while scanning the table need to cache the Data Frame, so 
from next onwards it wont take much time


was (Author: dileep):
Its a caching issue, while scanning the table need to cache the records, so 
from next onwards it wont take much time

> Spark should avoid scanning all partitions when limit is set
> 
>
> Key: SPARK-12843
> URL: https://issues.apache.org/jira/browse/SPARK-12843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> SQL Query:
> {code}
> select * from table limit 100
> {code}
> force Spark to scan all partition even when data are available on the 
> beginning of scan.
> This behaviour should be avoided and scan should stop when enough data is 
> collected.
> Is it related to: [SPARK-9850] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12932) Bad error message with trying to create Dataset from RDD of Java objects that are not bean-compliant

2016-01-21 Thread Andy Grove (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110785#comment-15110785
 ] 

Andy Grove commented on SPARK-12932:


Here is a pull request to change the error message: 
https://github.com/apache/spark/pull/10865

> Bad error message with trying to create Dataset from RDD of Java objects that 
> are not bean-compliant
> 
>
> Key: SPARK-12932
> URL: https://issues.apache.org/jira/browse/SPARK-12932
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.0
> Environment: Ubuntu 15.10 / Java 8
>Reporter: Andy Grove
>
> When trying to create a Dataset from an RDD of Person (all using the Java 
> API), I got the error "java.lang.UnsupportedOperationException: no encoder 
> found for example_java.dataset.Person". This is not a very helpful error and 
> no other logging information was apparent to help troubleshoot this.
> It turned out that the problem was that my Person class did not have a 
> default constructor and also did not have setter methods and that was the 
> root cause.
> This JIRA is for implementing a more usful error message to help Java 
> developers who are trying out the Dataset API for the first time.
> The full stack trace is:
> {code}
> Exception in thread "main" java.lang.UnsupportedOperationException: no 
> encoder found for example_java.common.Person
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$extractorFor(JavaTypeInference.scala:403)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.extractorsFor(JavaTypeInference.scala:314)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75)
>   at org.apache.spark.sql.Encoders$.bean(Encoder.scala:176)
>   at org.apache.spark.sql.Encoders.bean(Encoder.scala)
> {code}
> NOTE that if I do provide EITHER the default constructor OR the setters, but 
> not both, then I get a stack trace with much more useful information, but 
> omitting BOTH causes this issue.
> The original source is below.
> {code:title=Example.java}
> public class JavaDatasetExample {
> public static void main(String[] args) throws Exception {
> SparkConf sparkConf = new SparkConf()
> .setAppName("Example")
> .setMaster("local[*]");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SQLContext sqlContext = new SQLContext(sc);
> List people = ImmutableList.of(
> new Person("Joe", "Bloggs", 21, "NY")
> );
> Dataset dataset = sqlContext.createDataset(people, 
> Encoders.bean(Person.class));
> {code}
> {code:title=Person.java}
> class Person implements Serializable {
> String first;
> String last;
> int age;
> String state;
> public Person() {
> }
> public Person(String first, String last, int age, String state) {
> this.first = first;
> this.last = last;
> this.age = age;
> this.state = state;
> }
> public String getFirst() {
> return first;
> }
> public String getLast() {
> return last;
> }
> public int getAge() {
> return age;
> }
> public String getState() {
> return state;
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12534) Document missing command line options to Spark properties mapping

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12534:
--
Assignee: Felix Cheung  (was: Apache Spark)

> Document missing command line options to Spark properties mapping
> -
>
> Key: SPARK-12534
> URL: https://issues.apache.org/jira/browse/SPARK-12534
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Documentation, YARN
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.0
>
>
> Several Spark properties equivalent to Spark submit command line options are 
> missing.
> {quote}
> The equivalent for spark-submit --num-executors should be 
> spark.executor.instances
> When use in SparkConf?
> http://spark.apache.org/docs/latest/running-on-yarn.html
> Could you try setting that with sparkR.init()?
> _
> From: Franc Carter 
> Sent: Friday, December 25, 2015 9:23 PM
> Subject: number of executors in sparkR.init()
> To: 
> Hi,
> I'm having trouble working out how to get the number of executors set when 
> using sparkR.init().
> If I start sparkR with
>   sparkR  --master yarn --num-executors 6 
> then I get 6 executors
> However, if start sparkR with
>   sparkR 
> followed by
>   sc <- sparkR.init(master="yarn-client",   
> sparkEnvir=list(spark.num.executors='6'))
> then I only get 2 executors.
> Can anyone point me in the direction of what I might doing wrong ? I need to 
> initialise this was so that rStudio can hook in to SparkR
> thanks
> -- 
> Franc
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12932) Bad error message with trying to create Dataset from RDD of Java objects that are not bean-compliant

2016-01-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12932:


Assignee: Apache Spark

> Bad error message with trying to create Dataset from RDD of Java objects that 
> are not bean-compliant
> 
>
> Key: SPARK-12932
> URL: https://issues.apache.org/jira/browse/SPARK-12932
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 1.6.0
> Environment: Ubuntu 15.10 / Java 8
>Reporter: Andy Grove
>Assignee: Apache Spark
>Priority: Minor
>
> When trying to create a Dataset from an RDD of Person (all using the Java 
> API), I got the error "java.lang.UnsupportedOperationException: no encoder 
> found for example_java.dataset.Person". This is not a very helpful error and 
> no other logging information was apparent to help troubleshoot this.
> It turned out that the problem was that my Person class did not have a 
> default constructor and also did not have setter methods and that was the 
> root cause.
> This JIRA is for implementing a more usful error message to help Java 
> developers who are trying out the Dataset API for the first time.
> The full stack trace is:
> {code}
> Exception in thread "main" java.lang.UnsupportedOperationException: no 
> encoder found for example_java.common.Person
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$extractorFor(JavaTypeInference.scala:403)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.extractorsFor(JavaTypeInference.scala:314)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75)
>   at org.apache.spark.sql.Encoders$.bean(Encoder.scala:176)
>   at org.apache.spark.sql.Encoders.bean(Encoder.scala)
> {code}
> NOTE that if I do provide EITHER the default constructor OR the setters, but 
> not both, then I get a stack trace with much more useful information, but 
> omitting BOTH causes this issue.
> The original source is below.
> {code:title=Example.java}
> public class JavaDatasetExample {
> public static void main(String[] args) throws Exception {
> SparkConf sparkConf = new SparkConf()
> .setAppName("Example")
> .setMaster("local[*]");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SQLContext sqlContext = new SQLContext(sc);
> List people = ImmutableList.of(
> new Person("Joe", "Bloggs", 21, "NY")
> );
> Dataset dataset = sqlContext.createDataset(people, 
> Encoders.bean(Person.class));
> {code}
> {code:title=Person.java}
> class Person implements Serializable {
> String first;
> String last;
> int age;
> String state;
> public Person() {
> }
> public Person(String first, String last, int age, String state) {
> this.first = first;
> this.last = last;
> this.age = age;
> this.state = state;
> }
> public String getFirst() {
> return first;
> }
> public String getLast() {
> return last;
> }
> public int getAge() {
> return age;
> }
> public String getState() {
> return state;
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12932) Bad error message with trying to create Dataset from RDD of Java objects that are not bean-compliant

2016-01-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12932:


Assignee: (was: Apache Spark)

> Bad error message with trying to create Dataset from RDD of Java objects that 
> are not bean-compliant
> 
>
> Key: SPARK-12932
> URL: https://issues.apache.org/jira/browse/SPARK-12932
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 1.6.0
> Environment: Ubuntu 15.10 / Java 8
>Reporter: Andy Grove
>Priority: Minor
>
> When trying to create a Dataset from an RDD of Person (all using the Java 
> API), I got the error "java.lang.UnsupportedOperationException: no encoder 
> found for example_java.dataset.Person". This is not a very helpful error and 
> no other logging information was apparent to help troubleshoot this.
> It turned out that the problem was that my Person class did not have a 
> default constructor and also did not have setter methods and that was the 
> root cause.
> This JIRA is for implementing a more usful error message to help Java 
> developers who are trying out the Dataset API for the first time.
> The full stack trace is:
> {code}
> Exception in thread "main" java.lang.UnsupportedOperationException: no 
> encoder found for example_java.common.Person
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$extractorFor(JavaTypeInference.scala:403)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.extractorsFor(JavaTypeInference.scala:314)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75)
>   at org.apache.spark.sql.Encoders$.bean(Encoder.scala:176)
>   at org.apache.spark.sql.Encoders.bean(Encoder.scala)
> {code}
> NOTE that if I do provide EITHER the default constructor OR the setters, but 
> not both, then I get a stack trace with much more useful information, but 
> omitting BOTH causes this issue.
> The original source is below.
> {code:title=Example.java}
> public class JavaDatasetExample {
> public static void main(String[] args) throws Exception {
> SparkConf sparkConf = new SparkConf()
> .setAppName("Example")
> .setMaster("local[*]");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SQLContext sqlContext = new SQLContext(sc);
> List people = ImmutableList.of(
> new Person("Joe", "Bloggs", 21, "NY")
> );
> Dataset dataset = sqlContext.createDataset(people, 
> Encoders.bean(Person.class));
> {code}
> {code:title=Person.java}
> class Person implements Serializable {
> String first;
> String last;
> int age;
> String state;
> public Person() {
> }
> public Person(String first, String last, int age, String state) {
> this.first = first;
> this.last = last;
> this.age = age;
> this.state = state;
> }
> public String getFirst() {
> return first;
> }
> public String getLast() {
> return last;
> }
> public int getAge() {
> return age;
> }
> public String getState() {
> return state;
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12932) Bad error message with trying to create Dataset from RDD of Java objects that are not bean-compliant

2016-01-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110796#comment-15110796
 ] 

Apache Spark commented on SPARK-12932:
--

User 'andygrove' has created a pull request for this issue:
https://github.com/apache/spark/pull/10865

> Bad error message with trying to create Dataset from RDD of Java objects that 
> are not bean-compliant
> 
>
> Key: SPARK-12932
> URL: https://issues.apache.org/jira/browse/SPARK-12932
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 1.6.0
> Environment: Ubuntu 15.10 / Java 8
>Reporter: Andy Grove
>Priority: Minor
>
> When trying to create a Dataset from an RDD of Person (all using the Java 
> API), I got the error "java.lang.UnsupportedOperationException: no encoder 
> found for example_java.dataset.Person". This is not a very helpful error and 
> no other logging information was apparent to help troubleshoot this.
> It turned out that the problem was that my Person class did not have a 
> default constructor and also did not have setter methods and that was the 
> root cause.
> This JIRA is for implementing a more usful error message to help Java 
> developers who are trying out the Dataset API for the first time.
> The full stack trace is:
> {code}
> Exception in thread "main" java.lang.UnsupportedOperationException: no 
> encoder found for example_java.common.Person
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$extractorFor(JavaTypeInference.scala:403)
>   at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.extractorsFor(JavaTypeInference.scala:314)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75)
>   at org.apache.spark.sql.Encoders$.bean(Encoder.scala:176)
>   at org.apache.spark.sql.Encoders.bean(Encoder.scala)
> {code}
> NOTE that if I do provide EITHER the default constructor OR the setters, but 
> not both, then I get a stack trace with much more useful information, but 
> omitting BOTH causes this issue.
> The original source is below.
> {code:title=Example.java}
> public class JavaDatasetExample {
> public static void main(String[] args) throws Exception {
> SparkConf sparkConf = new SparkConf()
> .setAppName("Example")
> .setMaster("local[*]");
> JavaSparkContext sc = new JavaSparkContext(sparkConf);
> SQLContext sqlContext = new SQLContext(sc);
> List people = ImmutableList.of(
> new Person("Joe", "Bloggs", 21, "NY")
> );
> Dataset dataset = sqlContext.createDataset(people, 
> Encoders.bean(Person.class));
> {code}
> {code:title=Person.java}
> class Person implements Serializable {
> String first;
> String last;
> int age;
> String state;
> public Person() {
> }
> public Person(String first, String last, int age, String state) {
> this.first = first;
> this.last = last;
> this.age = age;
> this.state = state;
> }
> public String getFirst() {
> return first;
> }
> public String getLast() {
> return last;
> }
> public int getAge() {
> return age;
> }
> public String getState() {
> return state;
> }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110676#comment-15110676
 ] 

Sean Owen commented on SPARK-12741:
---

Wait, is this what you mean? "select count(*) ..." returns 1 row, which 
contains a number, which is the number of rows matching the predicate. count() 
returns 1 because there is 1 row in the result set. collect() collects (an 
Array of) that number of rows. Those are different; they're different things. 
That seems to be your problem.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12741) DataFrame count method return wrong size.

2016-01-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110676#comment-15110676
 ] 

Sean Owen edited comment on SPARK-12741 at 1/21/16 2:40 PM:


Wait, is this what you mean?

{code}
select count(*) ...
{code} 

returns 1 row, which contains a number, which is the number of rows matching 
the predicate. {{count()}} returns 1 because there is 1 row in the result set. 
{{collect()}} collects (an Array of) that number of rows. Those are different; 
they're different things. That seems to be your problem.


was (Author: srowen):
Wait, is this what you mean? {{select count(*) ...}} returns 1 row, which 
contains a number, which is the number of rows matching the predicate. 
{{count()}} returns 1 because there is 1 row in the result set. {{collect()}} 
collects (an Array of) that number of rows. Those are different; they're 
different things. That seems to be your problem.

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-21 Thread dileep (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110709#comment-15110709
 ] 

dileep edited comment on SPARK-12843 at 1/21/16 2:57 PM:
-

Please see the below Code snippet. We need to make use of caching mechanism of 
the data frame. 
DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1");
teenagers.cache();

Which is making significant improvement in the select query. So for subsequent 
select query it wont select the entire data



was (Author: dileep):
Please see the above Code. We need to make use of caching mechanism of the data 
frame. 
DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1");
teenagers.cache();

Which is making significant improvement in the select query. So for subsequent 
select query it wont select the entire data


> Spark should avoid scanning all partitions when limit is set
> 
>
> Key: SPARK-12843
> URL: https://issues.apache.org/jira/browse/SPARK-12843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> SQL Query:
> {code}
> select * from table limit 100
> {code}
> force Spark to scan all partition even when data are available on the 
> beginning of scan.
> This behaviour should be avoided and scan should stop when enough data is 
> collected.
> Is it related to: [SPARK-9850] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12534) Document missing command line options to Spark properties mapping

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12534:
--
Issue Type: Improvement  (was: Bug)

> Document missing command line options to Spark properties mapping
> -
>
> Key: SPARK-12534
> URL: https://issues.apache.org/jira/browse/SPARK-12534
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Documentation, YARN
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 2.0.0
>
>
> Several Spark properties equivalent to Spark submit command line options are 
> missing.
> {quote}
> The equivalent for spark-submit --num-executors should be 
> spark.executor.instances
> When use in SparkConf?
> http://spark.apache.org/docs/latest/running-on-yarn.html
> Could you try setting that with sparkR.init()?
> _
> From: Franc Carter 
> Sent: Friday, December 25, 2015 9:23 PM
> Subject: number of executors in sparkR.init()
> To: 
> Hi,
> I'm having trouble working out how to get the number of executors set when 
> using sparkR.init().
> If I start sparkR with
>   sparkR  --master yarn --num-executors 6 
> then I get 6 executors
> However, if start sparkR with
>   sparkR 
> followed by
>   sc <- sparkR.init(master="yarn-client",   
> sparkEnvir=list(spark.num.executors='6'))
> then I only get 2 executors.
> Can anyone point me in the direction of what I might doing wrong ? I need to 
> initialise this was so that rStudio can hook in to SparkR
> thanks
> -- 
> Franc
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5929) Pyspark: Register a pip requirements file with spark_context

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5929.
--
Resolution: Won't Fix

> Pyspark: Register a pip requirements file with spark_context
> 
>
> Key: SPARK-5929
> URL: https://issues.apache.org/jira/browse/SPARK-5929
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: buckhx
>Priority: Minor
>
> I've been doing a lot of dependency work with shipping dependencies to 
> workers as it is non-trivial for me to have my workers include the proper 
> dependencies in their own environments.
> To circumvent this, I added a addRequirementsFile() method that takes a pip 
> requirements file, downloads the packages, repackages them to be registered 
> with addPyFiles and ship them to workers.
> Here is a comparison of what I've done on the Palantir fork 
> https://github.com/buckheroux/spark/compare/palantir:master...master



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5647) Output metrics do not show up for older hadoop versions (< 2.5)

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5647.
--
Resolution: Duplicate

I think this is nearly moot for Spark 2.x, given that Hadoop support may get to 
2.4+ or 2.5+ only. I also don't see movement on this in a year.

> Output metrics do not show up for older hadoop versions (< 2.5)
> ---
>
> Key: SPARK-5647
> URL: https://issues.apache.org/jira/browse/SPARK-5647
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Kostas Sakellis
>Priority: Critical
>
> Need to add output metrics for hadoop < 2.5. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4247) [SQL] use beeline execute "create table as" thriftserver is not use "hive" user ,but the new hdfs dir's owner is "hive"

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4247.
--
Resolution: Not A Problem

I think this is stale anyway, but I think this is a question about Hive and how 
it sets up directories and permissions.

> [SQL] use beeline execute "create table as"  thriftserver is not use "hive"  
> user ,but the new hdfs dir's owner is "hive"  
> ---
>
> Key: SPARK-4247
> URL: https://issues.apache.org/jira/browse/SPARK-4247
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1
> Environment: java: 1.7
> hadoop: 2.3.0-cdh5.0.0
> spark: branch-1.1 lastest
> thriftserver with hive 0.12
> hive :0.13.1
> compile cmd:
> sh make-distribution.sh  --tgz -Phadoop-provided -Pyarn -DskipTests 
> -Dhadoop.version=2.3.0-cdh5.0.0 -Phive
>Reporter: qiaohaijun
>
> thriftserver start cmd:
> sudo -u ultraman sh start-thriftserver.sh
> ---
> beeline start cmd:
> sh beeline -u jdbc:hive2://x.x.x.x:1 -n ultraman -p **
> ---
> sql:
>  create table ultraman_tmp.test as select channel, subchannel from 
> custom.common_pc_pv where logdate >= '2014110210' and logdate <= '2014110210' 
> limit 10;
> the hdfs dir is follow:
> drwxr-xr-x - hive hdfs 0 2014-11-03 18:02 
> /user/hive/warehouse/ultraman_tmp.db/test
> 
> 2014-11-03 20:12:52,498 INFO [pool-10-thread-3] hive.metastore 
> (HiveMetaStoreClient.java:open(244)) - Trying to connect to metastore with 
> URI http://10.141.77.221:9083
> 2014-11-03 20:12:52,509 INFO [pool-10-thread-3] hive.metastore 
> (HiveMetaStoreClient.java:open(322)) - Waiting 1 seconds before next 
> connection attempt.
> 2014-11-03 20:12:53,510 INFO [pool-10-thread-3] hive.metastore 
> (HiveMetaStoreClient.java:open(332)) - Connected to metastore.
> 2014-11-03 20:12:53,899 ERROR [pool-10-thread-3] 
> server.SparkSQLOperationManager (Logging.scala:logError(96)) - Error 
> executing query:
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> sourceviewfs://nsX/user/hive/datadir-tmp/hive_2014-11-03_20-12-43_561_4822588651544736505-2/-ext-1/_SUCCESS
>  to destination 
> /user/hive/warehouse/ultraman_tmp.db/litao_sparksql_test_9/_SUCCESS
> at org.apache.hadoop.hive.ql.metadata.Hive.renameFile(Hive.java:2173)
> at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2227)
> at org.apache.hadoop.hive.ql.metadata.Table.copyFiles(Table.java:652)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1443)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.result$lzycompute(InsertIntoHiveTable.scala:243)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.result(InsertIntoHiveTable.scala:171)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:162)
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360)
> at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360)
> at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
> at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:103)
> at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98)
> at 
> org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:172)
> at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193)
> at 
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175)
> at 
> org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150)
> at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133)
> at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118)
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
> at 
> org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58)
> at 
> org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697)
> at 
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526)
> at 
> org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55)
> at 
>

[jira] [Commented] (SPARK-12941) Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype

2016-01-21 Thread Jose Martinez Poblete (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110906#comment-15110906
 ] 

Jose Martinez Poblete commented on SPARK-12941:
---

Thanks, let us know if this can be worked out on 1.4

> Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR 
> datatype
> --
>
> Key: SPARK-12941
> URL: https://issues.apache.org/jira/browse/SPARK-12941
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
> Environment: Apache Spark 1.4.2.2
>Reporter: Jose Martinez Poblete
>
> When exporting data from Spark to Oracle, string datatypes are translated to 
> TEXT for Oracle, this is leading to the following error
> {noformat}
> java.sql.SQLSyntaxErrorException: ORA-00902: invalid datatype
> {noformat}
> As per the following code:
> https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L144
> See also:
> http://stackoverflow.com/questions/31287182/writing-to-oracle-database-using-apache-spark-1-4-0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12956) add spark.yarn.hdfs.home.directory property

2016-01-21 Thread PJ Fanning (JIRA)

PJ Fanning created SPARK-12956:
--

 Summary: add spark.yarn.hdfs.home.directory property
 Key: SPARK-12956
 URL: https://issues.apache.org/jira/browse/SPARK-12956
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.6.0
Reporter: PJ Fanning


https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
 uses the default home directory based on the hadoop configuration. I have a 
use case where it would be useful to override this and to provide an explicit 
base path.
If this seems like a generally use config property, I can put together a pull 
request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12946) The SQL page is empty

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12946:
--
Target Version/s:   (was: 1.6.1)
   Fix Version/s: (was: 1.6.1)

> The SQL page is empty
> -
>
> Key: SPARK-12946
> URL: https://issues.apache.org/jira/browse/SPARK-12946
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: KaiXinXIaoLei
> Attachments: SQLpage.png
>
>
> I run sql query using  "bin/spark-sql --master yarn". Then i open the ui , 
> and find the SQL page is empty



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6137) G-Means clustering algorithm implementation

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6137.
--
Resolution: Won't Fix

> G-Means clustering algorithm implementation
> ---
>
> Key: SPARK-6137
> URL: https://issues.apache.org/jira/browse/SPARK-6137
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Denis Dus
>Priority: Minor
>  Labels: clustering
>
> Will it be useful to implement G-Means clustering algorithm based on K-Means?
> G-means is a powerful extension of k-means, which uses test of cluster data 
> normality to decide if it necessary to split current cluster into new two. 
> It's relative complexity (compared to k-Means) is O(K), where K is maximum 
> number of clusters. 
> The original paper is by Greg Hamerly and Charles Elkan from University of 
> California:
> [http://papers.nips.cc/paper/2526-learning-the-k-in-k-means.pdf]
> I also have a small prototype of this algorithm written in R (if anyone is 
> interested in it).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6056) Unlimit offHeap memory use cause RM killing the container

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6056.
--
Resolution: Not A Problem

> Unlimit offHeap memory use cause RM killing the container
> -
>
> Key: SPARK-6056
> URL: https://issues.apache.org/jira/browse/SPARK-6056
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.1
>Reporter: SaintBacchus
>
> No matter set the `preferDirectBufs` or limit the number of thread or not 
> ,spark can not limit the use of offheap memory.
> At line 269 of the class 'AbstractNioByteChannel' in netty-4.0.23.Final, 
> Netty had allocated a offheap memory buffer with the same size in heap.
> So how many buffer you want to transfor, the same size offheap memory will be 
> allocated.
> But once the allocated memory size reach the capacity of the overhead momery 
> set in yarn, this executor will be killed.
> I wrote a simple code to test it:
> {code:title=test.scala|borderStyle=solid}
> import org.apache.spark.storage._
> import org.apache.spark._
> val bufferRdd = sc.makeRDD(0 to 10, 10).map(x=>new 
> Array[Byte](10*1024*1024)).persist
> bufferRdd.count
> val part =  bufferRdd.partitions(0)
> val sparkEnv = SparkEnv.get
> val blockMgr = sparkEnv.blockManager
> def test = {
> val blockOption = blockMgr.get(RDDBlockId(bufferRdd.id, part.index))
> val resultIt = 
> blockOption.get.data.asInstanceOf[Iterator[Array[Byte]]]
> val len = resultIt.map(_.length).sum
> println(s"[${Thread.currentThread.getId}] get block length = $len")
> }
> def test_driver(count:Int, parallel:Int)(f: => Unit) = {
> val tpool = new scala.concurrent.forkjoin.ForkJoinPool(parallel)
> val taskSupport  = new 
> scala.collection.parallel.ForkJoinTaskSupport(tpool)
> val parseq = (1 to count).par
> parseq.tasksupport = taskSupport
> parseq.foreach(x=>f)
> tpool.shutdown
> tpool.awaitTermination(100, java.util.concurrent.TimeUnit.SECONDS)
> }
> {code}
> progress:
> 1. bin/spark-shell --master yarn-cilent --executor-cores 40 --num-executors 1
> 2. :load test.scala in spark-shell
> 3. use such comman to catch executor on slave node
> {code}
> pid=$(jps|grep CoarseGrainedExecutorBackend |awk '{print $1}');top -b -p 
> $pid|grep $pid
> {code}
> 4. test_driver(20,100)(test) in spark-shell
> 5. watch the output of the command on slave node
> If use multi-thread to get len, the physical memery will soon   exceed the 
> limit set by spark.yarn.executor.memoryOverhead



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4878) driverPropsFetcher causes spurious Akka disassociate errors

2016-01-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110865#comment-15110865
 ] 

Sean Owen commented on SPARK-4878:
--

I think this may be defunct anyway, but, the code in question does still exist. 
I don't see these errors in the logs of tests for example.

> driverPropsFetcher causes spurious Akka disassociate errors
> ---
>
> Key: SPARK-4878
> URL: https://issues.apache.org/jira/browse/SPARK-4878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Stephen Haberman
>Priority: Minor
>
> The dedicated Akka system to fetching driver properties seems fine, but it 
> leads to very misleading "AssociationHandle$Disassociated", dead letter, etc. 
> sort of messages that can lead the user to believe something is wrong with 
> the cluster.
> (E.g. personally I thought it was a Spark -rc1/-rc2 bug and spent awhile 
> poking around until I saw in the code that driverPropsFetcher is 
> purposefully/immediately shutdown.)
> Is there any way to cleanly shutdown that initial akka system so that the 
> driver doesn't log these errors?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12760) inaccurate description for difference between local vs cluster mode in closure handling

2016-01-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110908#comment-15110908
 ] 

Apache Spark commented on SPARK-12760:
--

User 'mortada' has created a pull request for this issue:
https://github.com/apache/spark/pull/10867

> inaccurate description for difference between local vs cluster mode in 
> closure handling
> ---
>
> Key: SPARK-12760
> URL: https://issues.apache.org/jira/browse/SPARK-12760
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Mortada Mehyar
>Priority: Minor
>
> In the spark documentation there's an example for illustrating how `local` 
> and `cluster` mode can differ 
> http://spark.apache.org/docs/latest/programming-guide.html#example
> " In local mode with a single JVM, the above code will sum the values within 
> the RDD and store it in counter. This is because both the RDD and the 
> variable counter are in the same memory space on the driver node." 
> However the above doesn't seem to be true. Even in `local` mode it seems like 
> the counter value should still be 0, because the variable will be summed up 
> in the executor memory space, but the final value in the driver memory space 
> is still 0. I tested this snippet and verified that in `local` mode the value 
> is indeed still 0. 
> Is the doc wrong or perhaps I'm missing something the doc is trying to say? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12650) No means to specify Xmx settings for SparkSubmit in yarn-cluster mode

2016-01-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110964#comment-15110964
 ] 

Sean Owen commented on SPARK-12650:
---

[~vanzin] is that the intended way to set this? if so it sounds like that's the 
resolution.

> No means to specify Xmx settings for SparkSubmit in yarn-cluster mode
> -
>
> Key: SPARK-12650
> URL: https://issues.apache.org/jira/browse/SPARK-12650
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.2
> Environment: Hadoop 2.6.0
>Reporter: John Vines
>
> Background-
> I have an app master designed to do some work and then launch a spark job.
> Issue-
> If I use yarn-cluster, then the SparkSubmit does not Xmx itself at all, 
> leading to the jvm taking a default heap which is relatively large. This 
> causes a large amount of vmem to be taken, so that it is killed by yarn. This 
> can be worked around by disabling Yarn's vmem check, but that is a hack.
> If I run it in yarn-client mode, it's fine as long as my container has enough 
> space for the driver, which is manageable. But I feel that the utter lack of 
> Xmx settings for what I believe is a very small jvm is a problem.
> I believe this was introduced with the fix for SPARK-3884



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-21 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110836#comment-15110836
 ] 

Maciej Bryński commented on SPARK-12843:


[~dileep]
I think you miss the point of this Jira.


> Spark should avoid scanning all partitions when limit is set
> 
>
> Key: SPARK-12843
> URL: https://issues.apache.org/jira/browse/SPARK-12843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> SQL Query:
> {code}
> select * from table limit 100
> {code}
> force Spark to scan all partition even when data are available on the 
> beginning of scan.
> This behaviour should be avoided and scan should stop when enough data is 
> collected.
> Is it related to: [SPARK-9850] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2016-01-21 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110853#comment-15110853
 ] 

Sean Owen commented on SPARK-5629:
--

Are all of the EC2 tickets becoming essentially "wont fix" as the support moves 
out of Spark itself?

> Add spark-ec2 action to return info about an existing cluster
> -
>
> Key: SPARK-5629
> URL: https://issues.apache.org/jira/browse/SPARK-5629
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> You can launch multiple clusters using spark-ec2. At some point, you might 
> just want to get some information about an existing cluster.
> Use cases include:
> * Wanting to check something about your cluster in the EC2 web console.
> * Wanting to feed information about your cluster to another tool (e.g. as 
> described in [SPARK-5627]).
> So, in addition to the [existing 
> actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
> * {{launch}}
> * {{destroy}}
> * {{login}}
> * {{stop}}
> * {{start}}
> * {{get-master}}
> * {{reboot-slaves}}
> We add a new action, {{describe}}, which describes an existing cluster if 
> given a cluster name, and all clusters if not.
> Some examples:
> {code}
> # describes all clusters launched by spark-ec2
> spark-ec2 describe
> {code}
> {code}
> # describes cluster-1
> spark-ec2 describe cluster-1
> {code}
> In combination with the proposal in [SPARK-5627]:
> {code}
> # describes cluster-3 in a machine-readable way (e.g. JSON)
> spark-ec2 describe cluster-3 --machine-readable
> {code}
> Parallels in similar tools include:
> * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
> * [{{starcluster 
> listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
>  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6009) IllegalArgumentException thrown by TimSort when SQL ORDER BY RAND ()

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6009.
--
Resolution: Duplicate

> IllegalArgumentException thrown by TimSort when SQL ORDER BY RAND ()
> 
>
> Key: SPARK-6009
> URL: https://issues.apache.org/jira/browse/SPARK-6009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.0
> Environment: Centos 7, Hadoop 2.6.0, Hive 0.15.0
> java version "1.7.0_75"
> OpenJDK Runtime Environment (rhel-2.5.4.2.el7_0-x86_64 u75-b13)
> OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
>Reporter: Paul Barber
>
> Running the following SparkSQL query over JDBC:
> {noformat}
>SELECT *
> FROM FAA
>   WHERE Year >= 1998 AND Year <= 1999
> ORDER BY RAND () LIMIT 10
> {noformat}
> This results in one or more workers throwing the following exception, with 
> variations for {{mergeLo}} and {{mergeHi}}. 
> {noformat}
> :java.lang.IllegalArgumentException: Comparison method violates its 
> general contract!
> - at java.util.TimSort.mergeHi(TimSort.java:868)
> - at java.util.TimSort.mergeAt(TimSort.java:485)
> - at java.util.TimSort.mergeCollapse(TimSort.java:410)
> - at java.util.TimSort.sort(TimSort.java:214)
> - at java.util.Arrays.sort(Arrays.java:727)
> - at 
> org.spark-project.guava.common.collect.Ordering.leastOf(Ordering.java:708)
> - at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
> - at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1138)
> - at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1135)
> - at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> - at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
> - at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> - at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
> - at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
> - at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> - at org.apache.spark.scheduler.Task.run(Task.scala:56)
> - at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> - at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> - at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> - at java.lang.Thread.run(Thread.java:745)
> {noformat}
> We have tested with both Spark 1.2.0 and Spark 1.2.1 and have seen the same 
> error in both. The query sometimes succeeds, but fails more often than not. 
> Whilst this sounds similar to bugs 3032 and 3656, we believe it it is not the 
> same.
> The {{ORDER BY RAND ()}} is using TimSort to produce the random ordering by 
> sorting a list of random values. Having spent some time looking at the issue 
> with jdb, it appears that the problem is triggered by the random values being 
> changed during the sort - the code which triggers this is in 
> {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Row.scala}}
>  - class RowOrdering, function compare, line 250 - where a new random number 
> is taken for the same row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4171) StreamingContext.actorStream throws serializationError

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4171.
--
Resolution: Won't Fix

I think this is obsolete now that the Akka actor bits are being removed.

> StreamingContext.actorStream throws serializationError
> --
>
> Key: SPARK-4171
> URL: https://issues.apache.org/jira/browse/SPARK-4171
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Shiti Saxena
>
> I encountered this issue when I was working on 
> https://issues.apache.org/jira/browse/SPARK-3872.
> Running the following test case on v1.1.0 and the master 
> branch(v1.2.0-SNAPSHOT) throws a serialization error. 
> {noformat}
>  test("actor input stream") {
> // Set up the streaming context and input streams
> val ssc = new StreamingContext(conf, batchDuration)
> val networkStream = ssc.actorStream[String](EchoActor.props, "TestActor",
>   // Had to pass the local value of port to prevent from closing over 
> entire scope
>   StorageLevel.MEMORY_AND_DISK)
> println("created actor")
> networkStream.print()
> ssc.start()
> Thread.sleep(3 * 1000)
> println("started stream")
> Thread.sleep(3*1000)
> logInfo("Stopping server")
> logInfo("Stopping context")
> ssc.stop()
>   }
> {noformat}
> where EchoActor is defined as 
> {noformat}
> class EchoActor extends Actor with ActorHelper {
>   override def receive = {
> case message => sender ! message
>   }
> }
> object EchoActor {
>   def props: Props = Props(new EchoActor())
> }
> {noformat}
> The same code works with v1.0.1 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12955) Spark-HiveSQL: It fail when is quering a nested structure

2016-01-21 Thread Gerardo Villarroel (JIRA)

Gerardo Villarroel created SPARK-12955:
--

 Summary: Spark-HiveSQL: It fail when is quering a nested structure
 Key: SPARK-12955
 URL: https://issues.apache.org/jira/browse/SPARK-12955
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.3.0
 Environment: CDH-5.4.7-1.cdh5.4.7
Reporter: Gerardo Villarroel


If you create a hive table using a structure with a nested structure and spark 
query it, it get fail.
you'll test with a structure similar to:
{ a : { a { aa1: "subValue", aa2: 1 } } , b : "value" }

a extended example is bellow:


{
  "namespace": "AA", 
  "type": "record", 
  "name": "CreditRecord", 
  "fields": [
{
  "doc": "string", 
  "type": "string", 
  "name": "aaa"
}, 
{
  "doc": "string", 
  "type": "string", 
  "name": "bbb"
}, 
{
  "doc": "string", 
  "type": "string", 
  "name": "l90"
}, 
{
  "doc": "boolean, ", 
  "type": ["boolean", "null"
  ], 
  "name": "isSubjectDeceased"
}, 
{
  "doc": "string", 
  "type": "string", 
  "name": "gender"
}, 
{
  "doc": "string", 
  "type": "string", 
  "name": "dateOfBirth"
}, 
{
  "doc": "array of Trade, ", 
  "type": [
{
  "items": {
"fields": [
  {
"doc": "int, ", 
"type": [
  "int", 
  "null"
], 
"name": "sequenceNumber"
  }, 
  {
"doc": "string", 
"type": "string", 
"name": "tradeKey"
  }, 
  {
"doc": "PHR,", 
"type": [
  {
"fields": [
  {
"doc": "string", 
"type": "string", 
"name": "rate"
  }, 
  {
"doc": "string", 
"type": "string", 
"name": "date"
  }
], 
"type": "record", 
"name": "PHR"
  }, 
  "null"
], 
"name": "phrgt48DR"
  }
], 
"type": "record", 
"name": "Trade"
  }, 
  "type": "array"
}, "null"
  ], 
  "name": "trades"
}
  ]
}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster

2016-01-21 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111056#comment-15111056
 ] 

Shivaram Venkataraman commented on SPARK-5629:
--

Yes - though I think its beneficial to see if the ticket is still valid and if 
it is, we can open a corresponding issue at github.com/amplab/spark-ec2. Then 
we can leave a marker here saying where this issue is being followed up at.

> Add spark-ec2 action to return info about an existing cluster
> -
>
> Key: SPARK-5629
> URL: https://issues.apache.org/jira/browse/SPARK-5629
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> You can launch multiple clusters using spark-ec2. At some point, you might 
> just want to get some information about an existing cluster.
> Use cases include:
> * Wanting to check something about your cluster in the EC2 web console.
> * Wanting to feed information about your cluster to another tool (e.g. as 
> described in [SPARK-5627]).
> So, in addition to the [existing 
> actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]:
> * {{launch}}
> * {{destroy}}
> * {{login}}
> * {{stop}}
> * {{start}}
> * {{get-master}}
> * {{reboot-slaves}}
> We add a new action, {{describe}}, which describes an existing cluster if 
> given a cluster name, and all clusters if not.
> Some examples:
> {code}
> # describes all clusters launched by spark-ec2
> spark-ec2 describe
> {code}
> {code}
> # describes cluster-1
> spark-ec2 describe cluster-1
> {code}
> In combination with the proposal in [SPARK-5627]:
> {code}
> # describes cluster-3 in a machine-readable way (e.g. JSON)
> spark-ec2 describe cluster-3 --machine-readable
> {code}
> Parallels in similar tools include:
> * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju
> * [{{starcluster 
> listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node]
>  from MIT StarCluster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1680) Clean up use of setExecutorEnvs in SparkConf

2016-01-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110979#comment-15110979
 ] 

Apache Spark commented on SPARK-1680:
-

User 'weineran' has created a pull request for this issue:
https://github.com/apache/spark/pull/10869

> Clean up use of setExecutorEnvs in SparkConf 
> -
>
> Key: SPARK-1680
> URL: https://issues.apache.org/jira/browse/SPARK-1680
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Thomas Graves
>Priority: Blocker
> Fix For: 1.1.0
>
>
> We should make this consistent between YARN and Standalone. Basically, YARN 
> mode should just use the executorEnvs from the Spark conf and not need 
> SPARK_YARN_USER_ENV.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-21 Thread Mario Briggs (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110994#comment-15110994
 ] 

Mario Briggs commented on SPARK-12177:
--

bq. If one uses the kafka v9 jar even when using the old consumer API, it can 
only work a Kafka v9 broker. 

I tried it on a single system setup (v0.9 client talking to v0.8 server-side) 
and the consumers had a problem (old or new). The producers though worked fine. 
So you are right. So then we will have kafka-assembly and 
kafka-assembly-v09/new and each including their version of kafka jars 
respectively right? ( I guess now, you were all along thinking 2 diff 
assemblies, and i guessed the other way round. Duh, IRC might have been faster)

With the above confirmed, it automatically throws out 'The public API 
signatures (of KafkaUtils in v0.9 subproject) are different and do not clash 
(with KafkaUtils in original kafka subproject) and hence can be added to the 
existing (original kafka subproject) KafkaUtils class.’ 

So the only thing left it seems is  to use 'new' or a better term instead of 
'v09', since we both agree on that. Great and thanks Mark.

How's the 'python/pyspark/streaming/kafka-v09(new).py' going

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12957) Derive and propagate data constrains in logical plan

2016-01-21 Thread Yin Huai (JIRA)

Yin Huai created SPARK-12957:


 Summary: Derive and propagate data constrains in logical plan 
 Key: SPARK-12957
 URL: https://issues.apache.org/jira/browse/SPARK-12957
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Yin Huai


Based on the semantic of a query plan, we can derive data constrains (e.g. if a 
filter defines {{a > 10}}, we know that the output data of this filter satisfy 
the constrain of {{a > 10}} and {{a is not null}}). We should build a framework 
to derive and propagate constrains in the logical plan, which can help us to 
build more advanced optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12650) No means to specify Xmx settings for SparkSubmit in yarn-cluster mode

2016-01-21 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111024#comment-15111024
 ] 

Marcelo Vanzin commented on SPARK-12650:


If it works it's a workaround; that's a pretty obscure legacy deprecated 
option, and we should have a more explicit alternative to it.

> No means to specify Xmx settings for SparkSubmit in yarn-cluster mode
> -
>
> Key: SPARK-12650
> URL: https://issues.apache.org/jira/browse/SPARK-12650
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.2
> Environment: Hadoop 2.6.0
>Reporter: John Vines
>
> Background-
> I have an app master designed to do some work and then launch a spark job.
> Issue-
> If I use yarn-cluster, then the SparkSubmit does not Xmx itself at all, 
> leading to the jvm taking a default heap which is relatively large. This 
> causes a large amount of vmem to be taken, so that it is killed by yarn. This 
> can be worked around by disabling Yarn's vmem check, but that is a hack.
> If I run it in yarn-client mode, it's fine as long as my container has enough 
> space for the driver, which is manageable. But I feel that the utter lack of 
> Xmx settings for what I believe is a very small jvm is a problem.
> I believe this was introduced with the fix for SPARK-3884



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2016-01-21 Thread Dan Dutrow (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111044#comment-15111044
 ] 

Dan Dutrow commented on SPARK-11045:


+1 to Dibyendu's comment that "Being at spark-packages, many [people do] not 
even consider using it [and] use the whatever Receiver Based model which is 
documented with Spark." Having hit limitations in both of the Receiver and 
Direct APIs, it would have been nice to have been pointed to the availability 
of alternatives.

> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of message can never give any deterministic 
> size of your block hence there is no guarantee that Memory Back-Pressure can 
> really take affect. 
> 3. This consumer is using Kafka low level API which gives better performance 
> than KafkaUtils.createStream based High Level API.
> 4. This consumer can give end to end no data loss channel if enabled with WAL.
> By accepting this low level kafka consumer from spark packages to apache 
> spark project , we will give community a better options for Kafka 
> connectivity both for Receiver less and Receiver based model. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10911) Executors should System.exit on clean shutdown

2016-01-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10911:


Assignee: Zhuo Liu  (was: Apache Spark)

> Executors should System.exit on clean shutdown
> --
>
> Key: SPARK-10911
> URL: https://issues.apache.org/jira/browse/SPARK-10911
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Zhuo Liu
>Priority: Minor
>
> Executors should call System.exit on clean shutdown to make sure all user 
> threads exit and jvm shuts down.
> We ran into a case where an Executor was left around for days trying to 
> shutdown because the user code was using a non-daemon thread pool and one of 
> those threads wasn't exiting.  We should force the jvm to go away with 
> System.exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2016-01-21 Thread Dibyendu Bhattacharya (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111206#comment-15111206
 ] 

Dibyendu Bhattacharya commented on SPARK-11045:
---

Thanks Dan for your comments . Same thoughts many has told to me as well , and 
if you see large number of people has voted for this consumer to be included in 
Spark Core. I wish all who voted for this and using the same should also 
comment about their opinion . Unfortunately Spark Comitters think otherwise. 
Spark still document faulty Receiver based model in their website which has 
issues , and there are many who need alternatives of Direct Stream but 
reluctant to use spark-packages library and go ahead and use what ever is 
mentioned in Spark website. This seems to me misguiding people and forcing them 
to use a buggy consumer despite a better alternatives exists. 

> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of message can never give any deterministic 
> size of your block hence there is no guarantee that Memory Back-Pressure can 
> really take affect. 
> 3. This consumer is using Kafka low level API which gives better performance 
> than KafkaUtils.createStream based High Level API.
> 4. This consumer can give end to end no data loss channel if enabled with WAL.
> By accepting this low level kafka consumer from spark packages to apache 
> spark project , we will give community a better options for Kafka 
> connectivity both for Receiver less and Receiver based model. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (SPARK-12797) Aggregation without grouping keys

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12797:
--
Assignee: Davies Liu

> Aggregation without grouping keys
> -
>
> Key: SPARK-12797
> URL: https://issues.apache.org/jira/browse/SPARK-12797
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12953:
--
 Priority: Minor  (was: Major)
Fix Version/s: (was: 1.6.1)
   Issue Type: Improvement  (was: Wish)

[~shijinkui] don't set Fix version

> RDDRelation write set mode will be better to avoid error "pair.parquet 
> already exists"
> --
>
> Key: SPARK-12953
> URL: https://issues.apache.org/jira/browse/SPARK-12953
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: shijinkui
>Priority: Minor
>
> It will be error if not set Write Mode when execute test case 
> `RDDRelation.main()`
> Exception in thread "main" org.apache.spark.sql.AnalysisException: path 
> file:/Users/sjk/pair.parquet already exists.;
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
>   at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
>   at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329)
>   at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65)
>   at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12945) ERROR LiveListenerBus: Listener JobProgressListener threw an exception

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12945:
--
Component/s: Web UI

> ERROR LiveListenerBus: Listener JobProgressListener threw an exception
> --
>
> Key: SPARK-12945
> URL: https://issues.apache.org/jira/browse/SPARK-12945
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
> Environment: Linux, yarn-client
>Reporter: Tristan
>Priority: Minor
>
> Seeing this a lot; not sure if it is a problem or spurious error (I recall 
> this was an ignorable issue in previous version). The UI seems to be working 
> fine:
> ERROR LiveListenerBus: Listener JobProgressListener threw an exception
> java.lang.NullPointerException
> at 
> org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:361)
> at 
> org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:360)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
> at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
> at 
> org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:360)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
> at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
> at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
> at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
> at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
> at 
> org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6034) DESCRIBE EXTENDED viewname is not supported for HiveContext

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6034.
--
Resolution: Won't Fix

> DESCRIBE EXTENDED viewname is not supported for HiveContext
> ---
>
> Key: SPARK-6034
> URL: https://issues.apache.org/jira/browse/SPARK-6034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> In HiveContext, describe extended [table | view | cache | datasource table] 
> is not well supported



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9282) Filter on Spark DataFrame with multiple columns

2016-01-21 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9282.
--
Resolution: Not A Problem

> Filter on Spark DataFrame with multiple columns
> ---
>
> Key: SPARK-9282
> URL: https://issues.apache.org/jira/browse/SPARK-9282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell, SQL
>Affects Versions: 1.3.0
> Environment: CDH 5.0 on CentOS6
>Reporter: Sandeep Pal
>
> Filter on dataframe does not work if we have more than one column inside the 
> filter. Nonetheless, it works on an RDD.
> Following is the example:
> df1.show()
> age coolid depid empname
> 23  7  1 sandeep
> 21  8  2 john   
> 24  9  1 cena   
> 45  12 3 bob
> 20  7  4 tanay  
> 12  8  5 gaurav 
> df1.filter(df1.age > 21 and df1.age < 45).show(10)
> 23  7  1 sandeep
> 21  8  2 john   <-
> 24  9  1 cena   
> 20  7  4 tanay <-
> 12  8  5 gaurav   <--



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2016-01-21 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111243#comment-15111243
 ] 

Saisai Shao commented on SPARK-11045:
-

Hi [~dibbhatt], I'm afraid I could not agree with your comment about current 
receiver-based Kafka connector. Currently Spark Streaming uses the high-level 
API to fetch data, beside using Kafka provided API, spark itself is quite 
simple, if you're saying this approach is buggy and faulty, I think it is buggy 
and faulty in Kafka high-level API, not the way Spark Streaming uses it. What 
you're doing is to fix such things which should be fixed by Kafka not Spark. 
Yes we could improve the current receiver-based way, but to build up from 
ground to fix such issues which will also be improved in Kafka will make it 
diverge from the Kafka's way. Just my two cents :).

> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of message can never give any deterministic 
> size of your block hence there is no guarantee that Memory Back-Pressure can 
> really take affect. 
> 3. This consumer is using Kafka low level API which gives better performance 
> than KafkaUtils.createStream based High Level API.
> 4. This consumer can give end to end no data loss channel if enabled with WAL.
> By accepting this low level kafka consumer from spark packages to apache 
> spark project , we will give community a better options for Kafka 
> connectivity both for Receiver less and Receiver based model. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (SPARK-10498) Add requirements file for create dev python tools

2016-01-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10498:


Assignee: (was: Apache Spark)

> Add requirements file for create dev python tools
> -
>
> Key: SPARK-10498
> URL: https://issues.apache.org/jira/browse/SPARK-10498
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: holdenk
>Priority: Minor
>
> Minor since so few people use them, but it would probably be good to have a 
> requirements file for our python release tools for easier setup (also version 
> pinning).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10498) Add requirements file for create dev python tools

2016-01-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111527#comment-15111527
 ] 

Apache Spark commented on SPARK-10498:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/10871

> Add requirements file for create dev python tools
> -
>
> Key: SPARK-10498
> URL: https://issues.apache.org/jira/browse/SPARK-10498
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: holdenk
>Priority: Minor
>
> Minor since so few people use them, but it would probably be good to have a 
> requirements file for our python release tools for easier setup (also version 
> pinning).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10498) Add requirements file for create dev python tools

2016-01-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10498:


Assignee: Apache Spark

> Add requirements file for create dev python tools
> -
>
> Key: SPARK-10498
> URL: https://issues.apache.org/jira/browse/SPARK-10498
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>
> Minor since so few people use them, but it would probably be good to have a 
> requirements file for our python release tools for easier setup (also version 
> pinning).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12960) Some examples are missing support for python2

2016-01-21 Thread Mark Grover (JIRA)

Mark Grover created SPARK-12960:
---

 Summary: Some examples are missing support for python2
 Key: SPARK-12960
 URL: https://issues.apache.org/jira/browse/SPARK-12960
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.0
Reporter: Mark Grover
Priority: Minor


Without importing the print_function, the lines later on like 
{code}
print("Usage: direct_kafka_wordcount.py  ", file=sys.stderr)
{code}
fail when using python2.*. Import fixes that problem and doesn't break anything 
on python3 either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project

2016-01-21 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111398#comment-15111398
 ] 

Cody Koeninger commented on SPARK-11045:


There's already work being done on 0.9

https://issues.apache.org/jira/browse/SPARK-12177


On Thu, Jan 21, 2016 at 3:08 PM, Marko Bonaci (JIRA) 



> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to 
> Apache Spark Project
> 
>
> Key: SPARK-11045
> URL: https://issues.apache.org/jira/browse/SPARK-11045
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Dibyendu Bhattacharya
>
> This JIRA is to track the progress of making the Receiver based Low Level 
> Kafka Consumer from spark-packages 
> (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be 
> contributed back to Apache Spark Project.
> This Kafka consumer has been around for more than year and has matured over 
> the time . I see there are many adoptions of this package . I receive 
> positive feedbacks that this consumer gives better performance and fault 
> tolerant capabilities. 
> This is the primary intent of this JIRA to give community a better 
> alternative if they want to use Receiver Base model. 
> If this consumer make it to Spark Core, it will definitely see more adoption 
> and support from community and help many who still prefer the Receiver Based 
> model of Kafka Consumer. 
> I understand the Direct Stream is the consumer which can give Exact Once 
> semantics and uses Kafka Low Level API  , which is good . But Direct Stream 
> has concerns around recovering checkpoint on driver code change . Application 
> developer need to manage their own offset which complex . Even if some one 
> does manages their own offset , it limits the parallelism Spark Streaming can 
> achieve. If someone wants more parallelism and want 
> spark.streaming.concurrentJobs more than 1 , you can no longer rely on 
> storing offset externally as you have no control which batch will run in 
> which sequence. 
> Furthermore , the Direct Stream has higher latency , as it fetch messages 
> form Kafka during RDD action . Also number of RDD partitions are limited to 
> topic partition . So unless your Kafka topic does not have enough partitions, 
> you have limited parallelism while RDD processing. 
> Due to above mentioned concerns , many people who does not want Exactly Once 
> semantics , still prefer Receiver based model. Unfortunately, when customer 
> fall back to KafkaUtil.CreateStream approach, which use Kafka High Level 
> Consumer, there are other issues around the reliability of Kafka High Level 
> API.  Kafka High Level API is buggy and has serious issue around Consumer 
> Re-balance. Hence I do not think this is correct to advice people to use 
> KafkaUtil.CreateStream in production . 
> The better option presently is there is to use the Consumer from 
> spark-packages . It is is using Kafka Low Level Consumer API , store offset 
> in Zookeeper, and can recover from any failure . Below are few highlights of 
> this consumer  ..
> 1. It has a inbuilt PID Controller for dynamic rate limiting.
> 2. In this consumer ,  The Rate Limiting is done by modifying the size blocks 
> by controlling the size of messages pulled from Kafka. Whereas , in Spark the 
> Rate Limiting is done by controlling number of  messages. The issue with 
> throttling by number of message is, if message size various, block size will 
> also vary . Let say your Kafka has messages for different sizes from 10KB to 
> 500 KB. Thus throttling by number of message can never give any deterministic 
> size of your block hence there is no guarantee that Memory Back-Pressure can 
> really take affect. 
> 3. This consumer is using Kafka low level API which gives better performance 
> than KafkaUtils.createStream based High Level API.
> 4. This consumer can give end to end no data loss channel if enabled with WAL.
> By accepting this low level kafka consumer from spark packages to apache 
> spark project , we will give community a better options for Kafka 
> connectivity both for Receiver less and Receiver based model. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12946) The SQL page is empty

2016-01-21 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111598#comment-15111598
 ] 

Alex Bozarth commented on SPARK-12946:
--

Can you give more details on this? Such as what commands did you run in the 
shell, what are your spark/yarn/hive configs, and how did you access the Web UI?

> The SQL page is empty
> -
>
> Key: SPARK-12946
> URL: https://issues.apache.org/jira/browse/SPARK-12946
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: KaiXinXIaoLei
> Attachments: SQLpage.png
>
>
> I run sql query using  "bin/spark-sql --master yarn". Then i open the ui , 
> and find the SQL page is empty



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12957) Derive and propagate data constrains in logical plan

2016-01-21 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111597#comment-15111597
 ] 

Xiao Li commented on SPARK-12957:
-

I have two related PRs that require a general null filtering function: 
https://github.com/apache/spark/pull/10567 and 
https://github.com/apache/spark/pull/10566. Could you share your opinions how 
to make it more general? 

> Derive and propagate data constrains in logical plan 
> -
>
> Key: SPARK-12957
> URL: https://issues.apache.org/jira/browse/SPARK-12957
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>
> Based on the semantic of a query plan, we can derive data constrains (e.g. if 
> a filter defines {{a > 10}}, we know that the output data of this filter 
> satisfy the constrain of {{a > 10}} and {{a is not null}}). We should build a 
> framework to derive and propagate constrains in the logical plan, which can 
> help us to build more advanced optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12946) The SQL page is empty

2016-01-21 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111645#comment-15111645
 ] 

Alex Bozarth commented on SPARK-12946:
--

These may be the same problem

> The SQL page is empty
> -
>
> Key: SPARK-12946
> URL: https://issues.apache.org/jira/browse/SPARK-12946
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: KaiXinXIaoLei
> Attachments: SQLpage.png
>
>
> I run sql query using  "bin/spark-sql --master yarn". Then i open the ui , 
> and find the SQL page is empty



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-12946) The SQL page is empty

2016-01-21 Thread Alex Bozarth (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Bozarth updated SPARK-12946:
-
Comment: was deleted

(was: These may be the same problem)

> The SQL page is empty
> -
>
> Key: SPARK-12946
> URL: https://issues.apache.org/jira/browse/SPARK-12946
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: KaiXinXIaoLei
> Attachments: SQLpage.png
>
>
> I run sql query using  "bin/spark-sql --master yarn". Then i open the ui , 
> and find the SQL page is empty



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12859) Names of input streams with receivers don't fit in Streaming page

2016-01-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12859:


Assignee: (was: Apache Spark)

> Names of input streams with receivers don't fit in Streaming page
> -
>
> Key: SPARK-12859
> URL: https://issues.apache.org/jira/browse/SPARK-12859
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Trivial
> Attachments: spark-streaming-webui-names-too-long-to-fit.png
>
>
> Since the column for the names of input streams with receivers (under Input 
> Rate) is fixed, the not-so-long names don't fit in Streaming page. See the 
> attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12859) Names of input streams with receivers don't fit in Streaming page

2016-01-21 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12859:


Assignee: Apache Spark

> Names of input streams with receivers don't fit in Streaming page
> -
>
> Key: SPARK-12859
> URL: https://issues.apache.org/jira/browse/SPARK-12859
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Apache Spark
>Priority: Trivial
> Attachments: spark-streaming-webui-names-too-long-to-fit.png
>
>
> Since the column for the names of input streams with receivers (under Input 
> Rate) is fixed, the not-so-long names don't fit in Streaming page. See the 
> attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12859) Names of input streams with receivers don't fit in Streaming page

2016-01-21 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111660#comment-15111660
 ] 

Apache Spark commented on SPARK-12859:
--

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/10873

> Names of input streams with receivers don't fit in Streaming page
> -
>
> Key: SPARK-12859
> URL: https://issues.apache.org/jira/browse/SPARK-12859
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming, Web UI
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Trivial
> Attachments: spark-streaming-webui-names-too-long-to-fit.png
>
>
> Since the column for the names of input streams with receivers (under Input 
> Rate) is fixed, the not-so-long names don't fit in Streaming page. See the 
> attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12959) Silent switch to normal table writing when writing bucketed data with bucketing disabled

2016-01-21 Thread Xiao Li (JIRA)

Xiao Li created SPARK-12959:
---

 Summary: Silent switch to normal table writing when writing 
bucketed data with bucketing disabled
 Key: SPARK-12959
 URL: https://issues.apache.org/jira/browse/SPARK-12959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


When users turn off bucketing in SQLConf, we should issue some messages to tell 
users these operations will be converted to normal way. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9721) TreeTests.checkEqual should compare predictions on data

2016-01-21 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111579#comment-15111579
 ] 

Seth Hendrickson commented on SPARK-9721:
-

I assume there is some motivation behind this JIRA, but given the current tree 
structure in spark.ml, I cannot think of a case where all splits and node 
predictions could be equal for two trees, but they do not predict exactly the 
same values for a given dataset. Is there a specific case this change would 
protect against?

> TreeTests.checkEqual should compare predictions on data
> ---
>
> Key: SPARK-9721
> URL: https://issues.apache.org/jira/browse/SPARK-9721
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Reporter: Joseph K. Bradley
>
> In spark.ml tree and ensemble unit tests:
> Modify TreeTests.checkEqual in unit tests to compare predictions on the 
> training data, rather than only comparing the trees themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 155 matches

Mail list logo