[jira] [Updated] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sasi updated SPARK-12741: - Description: Hi, I'm updating my report. I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I have 2 method, one for collect data and other for count. method doQuery looks like: {code} dataFrame.collect() {code} method doQueryCount looks like: {code} dataFrame.count() {code} I have few scenarios with few results: 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 2) 3 rows exists results: count 0 and collect 3. 3) 5 rows exists results: count 2 and collect 5. I tried to change the count code to the below code, but got the same results as I mentioned above. {code} dataFrame.sql("select count(*) from tbl").count/collect[0] {code} Thanks, Sasi was: Hi, I'm updating my report. I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I have 2 method, one for collect data and other for count. method doQuery looks like: {code} subscribersDataFrame.collect() {code} method doQueryCount looks like: {code} subscribersDataFrame.count() {code} I have few scenarios with few results: 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 2) 3 rows exists results: count 0 and collect 3. 3) 5 rows exists results: count 2 and collect 5. I tried to change the count code to the below code, but got the same results as I mentioned above. {code} subscribersDataFrame.sql("select count(*) from tbl").count/collect[0] {code} Thanks, Sasi > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12954) pyspark API 1.3.0 how we can patitionning by columns
malouke created SPARK-12954: --- Summary: pyspark API 1.3.0 how we can patitionning by columns Key: SPARK-12954 URL: https://issues.apache.org/jira/browse/SPARK-12954 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0 Environment: spark 1.3.0 cloudera manger linux platfrome pyspark Reporter: malouke Priority: Blocker hi, before posting this question i try lot of things , but i dont found solution. i have 9 table and i join thems with two ways: -1 first test with df.join(df2, df.id == df.id2,'left_outer') -2 sqlcontext.sql("select * from t1 left join t2 on id_t1=id_t2") after that i want partition by date the result of join : -in pyspark 1.5.2 i try partitionBy if table it's not comming from result of at most two tables evry thiings ok. but when i join more than three tables i dont have result after severals hours . - in pyspark 1.3.0 i dont found in api one function let me partition by dat columns Q: some one can help me to resolve this probleme thank you in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12954) pyspark API 1.3.0 how we can patitionning by columns
[ https://issues.apache.org/jira/browse/SPARK-12954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12954. --- Resolution: Invalid Target Version/s: (was: 1.3.0) [~Malouke] a lot is wrong with this. Please don't open a JIRA until you read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Don't set Blocker or Target Version. Ask questions on u...@spark.apache.org > pyspark API 1.3.0 how we can patitionning by columns > --- > > Key: SPARK-12954 > URL: https://issues.apache.org/jira/browse/SPARK-12954 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0 > Environment: spark 1.3.0 > cloudera manger > linux platfrome > pyspark >Reporter: malouke >Priority: Blocker > Labels: documentation, features, performance, test > > hi, > before posting this question i try lot of things , but i dont found solution. > i have 9 table and i join thems with two ways: > -1 first test with df.join(df2, df.id == df.id2,'left_outer') > -2 sqlcontext.sql("select * from t1 left join t2 on id_t1=id_t2") > after that i want partition by date the result of join : > -in pyspark 1.5.2 i try partitionBy if table it's not comming from result of > at most two tables evry thiings ok. but when i join more than three tables i > dont have result after severals hours . > - in pyspark 1.3.0 i dont found in api one function let me partition by dat > columns > Q: some one can help me to resolve this probleme > thank you in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2309) Generalize the binary logistic regression into multinomial logistic regression
[ https://issues.apache.org/jira/browse/SPARK-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110628#comment-15110628 ] Daniel Darabos commented on SPARK-2309: --- https://github.com/apache/spark/blob/v1.6.0/docs/ml-classification-regression.md#logistic-regression still says: > The current implementation of logistic regression in spark.ml only supports > binary classes. Support for multiclass regression will be added in the future. That can be removed now, right? > Generalize the binary logistic regression into multinomial logistic regression > -- > > Key: SPARK-2309 > URL: https://issues.apache.org/jira/browse/SPARK-2309 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: DB Tsai >Assignee: DB Tsai >Priority: Critical > Fix For: 1.3.0 > > > Currently, there is no multi-class classifier in mllib. Logistic regression > can be extended to multinomial one straightforwardly. > The following formula will be implemented. > http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110523#comment-15110523 ] Sasi commented on SPARK-12741: -- I changed the way I used the DataFrame from my last ticket. Now, I have dataframe without any cache or persist operation, so each time I add/remove row I see it on the dataframe. The only problem I'm having right now is the count operation. I'll try to create new dataframe for each count calls, maybe it will resolve the problem. By the way, I see the count result change, e.g. from 3 to 5, but the count size not equals to the real dataframe size. Thanks again! Sasi > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110553#comment-15110553 ] Sasi commented on SPARK-12741: -- Create new DataFame didn't resolve the issue. I still think its bug. Thanks, Sasi > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110584#comment-15110584 ] Sasi commented on SPARK-12741: -- I checked my DB which is Aerospike, and I got the same results of my collect. I'm creating my DataFrame with Aerospark which is a connector written by Sasha, https://github.com/sasha-polev/aerospark/ I'm using the DataFrame actions as describe in sql-programming-guide, https://spark.apache.org/docs/1.3.0/sql-programming-guide.html I know there're two ways to do actions on DataFrame: 1) SQL way. {code} dataFrame.sqlContext().sql("select count(*) from tbl").collect()[0] {code} 2) DataFrame way. {code} dataFrame.where("...").count() {code} I'm using the DataFrame way which is more simple to understand and to read as a JAVA code. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sasi updated SPARK-12741: - Description: Hi, I'm updating my report. I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I have 2 method, one for collect data and other for count. method doQuery looks like: {code} subscribersDataFrame.collect() {code} method doQueryCount looks like: {code} subscribersDataFrame.count() {code} I have few scenarios with few results: 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 2) 3 rows exists results: count 0 and collect 3. 3) 5 rows exists results: count 2 and collect 5. I tried to change the count code to the below code, but got the same results as I mentioned above. {code} subscribersDataFrame.sql("select count(*) from tbl").count/collect[0] {code} Thanks, Sasi was: Hi, I noted that DataFrame count method always return wrong size. Assume I have 11 records. When running dataframe.count() I get 9. Also if I'm running dataframe.collectAsList() then i'll get 9 records instead of 11. But if I run dataframe.collect() then i'll get 11. Thanks, Sasi > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > subscribersDataFrame.collect() > {code} > method doQueryCount looks like: > {code} > subscribersDataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > subscribersDataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110512#comment-15110512 ] dileep commented on SPARK-12843: I will look in to this issue > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej BryĆski > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12954) pyspark API 1.3.0 how we can patitionning by columns
[ https://issues.apache.org/jira/browse/SPARK-12954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110596#comment-15110596 ] malouke commented on SPARK-12954: - ok sorry, > pyspark API 1.3.0 how we can patitionning by columns > --- > > Key: SPARK-12954 > URL: https://issues.apache.org/jira/browse/SPARK-12954 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0 > Environment: spark 1.3.0 > cloudera manger > linux platfrome > pyspark >Reporter: malouke >Priority: Blocker > Labels: documentation, features, performance, test > > hi, > before posting this question i try lot of things , but i dont found solution. > i have 9 table and i join thems with two ways: > -1 first test with df.join(df2, df.id == df.id2,'left_outer') > -2 sqlcontext.sql("select * from t1 left join t2 on id_t1=id_t2") > after that i want partition by date the result of join : > -in pyspark 1.5.2 i try partitionBy if table it's not comming from result of > at most two tables evry thiings ok. but when i join more than three tables i > dont have result after severals hours . > - in pyspark 1.3.0 i dont found in api one function let me partition by dat > columns > Q: some one can help me to resolve this probleme > thank you in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110563#comment-15110563 ] Sasi commented on SPARK-12741: -- If I'm running the following code: {code} dataFrame.where("...").count() {code} The result is the same as the collect. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12954) pyspark API 1.3.0 how we can patitionning by columns
[ https://issues.apache.org/jira/browse/SPARK-12954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110603#comment-15110603 ] malouke commented on SPARK-12954: - hi sean, where i can ask question ? > pyspark API 1.3.0 how we can patitionning by columns > --- > > Key: SPARK-12954 > URL: https://issues.apache.org/jira/browse/SPARK-12954 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.3.0 > Environment: spark 1.3.0 > cloudera manger > linux platfrome > pyspark >Reporter: malouke >Priority: Blocker > Labels: documentation, features, performance, test > > hi, > before posting this question i try lot of things , but i dont found solution. > i have 9 table and i join thems with two ways: > -1 first test with df.join(df2, df.id == df.id2,'left_outer') > -2 sqlcontext.sql("select * from t1 left join t2 on id_t1=id_t2") > after that i want partition by date the result of join : > -in pyspark 1.5.2 i try partitionBy if table it's not comming from result of > at most two tables evry thiings ok. but when i join more than three tables i > dont have result after severals hours . > - in pyspark 1.3.0 i dont found in api one function let me partition by dat > columns > Q: some one can help me to resolve this probleme > thank you in advance -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110572#comment-15110572 ] Sean Owen commented on SPARK-12741: --- I can't reproduce this. I always get the same count and collect result on an example data set. Above I think you mean sqlContext not dataFrame right? this makes me wonder what you're really executing. Also, couldn't it be an issue with your DB? > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110503#comment-15110503 ] Sasi commented on SPARK-12741: -- I updated the report, can you verify it again. Thanks! Sasi > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > subscribersDataFrame.collect() > {code} > method doQueryCount looks like: > {code} > subscribersDataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > subscribersDataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110613#comment-15110613 ] Sasi commented on SPARK-12741: -- Addtional update: If I use the following code, then I get the same length of the collect. {code} subscribersDataFrame.rdd().count(); {code} > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110511#comment-15110511 ] Sean Owen commented on SPARK-12741: --- I recall from other JIRAs that you're not updating the DataFrame / RDD but updating the table? that's not going to cause the result to change -- or it may, it's undefined. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-12247: --- Assignee: Benjamin Fradet > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 1.5.2 >Reporter: Timothy Hunter >Assignee: Benjamin Fradet > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"
[ https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12953: Assignee: Apache Spark > RDDRelation write set mode will be better to avoid error "pair.parquet > already exists" > -- > > Key: SPARK-12953 > URL: https://issues.apache.org/jira/browse/SPARK-12953 > Project: Spark > Issue Type: Wish > Components: Examples >Reporter: shijinkui >Assignee: Apache Spark > Fix For: 1.6.1 > > > It will be error if not set Write Mode when execute test case > `RDDRelation.main()` > Exception in thread "main" org.apache.spark.sql.AnalysisException: path > file:/Users/sjk/pair.parquet already exists.; > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329) > at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65) > at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"
[ https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110301#comment-15110301 ] Apache Spark commented on SPARK-12953: -- User 'shijinkui' has created a pull request for this issue: https://github.com/apache/spark/pull/10864 > RDDRelation write set mode will be better to avoid error "pair.parquet > already exists" > -- > > Key: SPARK-12953 > URL: https://issues.apache.org/jira/browse/SPARK-12953 > Project: Spark > Issue Type: Wish > Components: Examples >Reporter: shijinkui > Fix For: 1.6.1 > > > It will be error if not set Write Mode when execute test case > `RDDRelation.main()` > Exception in thread "main" org.apache.spark.sql.AnalysisException: path > file:/Users/sjk/pair.parquet already exists.; > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329) > at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65) > at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"
shijinkui created SPARK-12953: - Summary: RDDRelation write set mode will be better to avoid error "pair.parquet already exists" Key: SPARK-12953 URL: https://issues.apache.org/jira/browse/SPARK-12953 Project: Spark Issue Type: Wish Components: SQL Reporter: shijinkui Fix For: 1.6.1 It will be error if not set Write Mode when execute test case `RDDRelation.main()` Exception in thread "main" org.apache.spark.sql.AnalysisException: path file:/Users/sjk/pair.parquet already exists.; at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329) at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65) at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1
[ https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110354#comment-15110354 ] Sasi commented on SPARK-12906: -- Looks like fixed on 1.5.2. Thanks! > LongSQLMetricValue cause memory leak on Spark 1.5.1 > --- > > Key: SPARK-12906 > URL: https://issues.apache.org/jira/browse/SPARK-12906 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Sasi > Attachments: dump1.PNG, screenshot-1.png > > > Hi, > I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the > scala.util.parsing.combinator.Parser$$anon$3 cause memory leak. > Now, after doing another dump heap I notice, after 2 hours, that > LongSQLMetricValue cause memory leak. > Didn't see any bug or document about it. > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12906) LongSQLMetricValue cause memory leak on Spark 1.5.1
[ https://issues.apache.org/jira/browse/SPARK-12906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12906. --- Resolution: Duplicate > LongSQLMetricValue cause memory leak on Spark 1.5.1 > --- > > Key: SPARK-12906 > URL: https://issues.apache.org/jira/browse/SPARK-12906 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Sasi > Attachments: dump1.PNG, screenshot-1.png > > > Hi, > I was upgrade my spark from 1.5.0 to 1.5.1 after saw that the > scala.util.parsing.combinator.Parser$$anon$3 cause memory leak. > Now, after doing another dump heap I notice, after 2 hours, that > LongSQLMetricValue cause memory leak. > Didn't see any bug or document about it. > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general
[ https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath updated SPARK-12247: --- Affects Version/s: (was: 1.5.2) 2.0.0 > Documentation for spark.ml's ALS and collaborative filtering in general > --- > > Key: SPARK-12247 > URL: https://issues.apache.org/jira/browse/SPARK-12247 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib >Affects Versions: 2.0.0 >Reporter: Timothy Hunter >Assignee: Benjamin Fradet > > We need to add a section in the documentation about collaborative filtering > in the dataframe API: > - copy explanations about collaborative filtering and ALS from spark.mllib > - provide an example with spark.ml's ALS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110299#comment-15110299 ] Felix Cheung commented on SPARK-6817: - Thanks for putting together on the doc [~sunrui] In this design, how does one control the partitioning? For instance, suppose one would like to group census data DataFrame by a certain column, say MetropolitanArea, and then pass to R's kmeans to cluster residents within close-by geographical areas. In order for the R UDFs to be effective, in this and some other cases, one would need to make sure the data is partition appropriately, and that mapPartition would produce a local R data.frame (assuming it fits into memory) that has all the relevant data in it? > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"
[ https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shijinkui updated SPARK-12953: -- Component/s: (was: SQL) Examples > RDDRelation write set mode will be better to avoid error "pair.parquet > already exists" > -- > > Key: SPARK-12953 > URL: https://issues.apache.org/jira/browse/SPARK-12953 > Project: Spark > Issue Type: Wish > Components: Examples >Reporter: shijinkui > Fix For: 1.6.1 > > > It will be error if not set Write Mode when execute test case > `RDDRelation.main()` > Exception in thread "main" org.apache.spark.sql.AnalysisException: path > file:/Users/sjk/pair.parquet already exists.; > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329) > at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65) > at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"
[ https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12953: Assignee: (was: Apache Spark) > RDDRelation write set mode will be better to avoid error "pair.parquet > already exists" > -- > > Key: SPARK-12953 > URL: https://issues.apache.org/jira/browse/SPARK-12953 > Project: Spark > Issue Type: Wish > Components: Examples >Reporter: shijinkui > Fix For: 1.6.1 > > > It will be error if not set Write Mode when execute test case > `RDDRelation.main()` > Exception in thread "main" org.apache.spark.sql.AnalysisException: path > file:/Users/sjk/pair.parquet already exists.; > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329) > at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65) > at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110359#comment-15110359 ] Sun Rui commented on SPARK-6817: for dapply(), user can call repartition() to set an appropriate number of partitions before calling dapply(). for gapply(), the SQL conf "spark.sql.shuffle.partitions" could be used to tune the partitions number after shuffle. I am also hoping SPARK-9850 Adaptive execution in Spark could help. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110712#comment-15110712 ] dileep edited comment on SPARK-12843 at 1/21/16 2:57 PM: - Its a caching issue, while scanning the table need to cache the records, so from next onwards it wont take much time was (Author: dileep): Its a caching issue, while scanning the table need to cache the records, so from mext onwards it wont take much time > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej BryĆski > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12932) Bad error message with trying to create Dataset from RDD of Java objects that are not bean-compliant
[ https://issues.apache.org/jira/browse/SPARK-12932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110786#comment-15110786 ] Sean Owen commented on SPARK-12932: --- OK, do you want to make a PR for that? > Bad error message with trying to create Dataset from RDD of Java objects that > are not bean-compliant > > > Key: SPARK-12932 > URL: https://issues.apache.org/jira/browse/SPARK-12932 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.6.0 > Environment: Ubuntu 15.10 / Java 8 >Reporter: Andy Grove > > When trying to create a Dataset from an RDD of Person (all using the Java > API), I got the error "java.lang.UnsupportedOperationException: no encoder > found for example_java.dataset.Person". This is not a very helpful error and > no other logging information was apparent to help troubleshoot this. > It turned out that the problem was that my Person class did not have a > default constructor and also did not have setter methods and that was the > root cause. > This JIRA is for implementing a more usful error message to help Java > developers who are trying out the Dataset API for the first time. > The full stack trace is: > {code} > Exception in thread "main" java.lang.UnsupportedOperationException: no > encoder found for example_java.common.Person > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$extractorFor(JavaTypeInference.scala:403) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.extractorsFor(JavaTypeInference.scala:314) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75) > at org.apache.spark.sql.Encoders$.bean(Encoder.scala:176) > at org.apache.spark.sql.Encoders.bean(Encoder.scala) > {code} > NOTE that if I do provide EITHER the default constructor OR the setters, but > not both, then I get a stack trace with much more useful information, but > omitting BOTH causes this issue. > The original source is below. > {code:title=Example.java} > public class JavaDatasetExample { > public static void main(String[] args) throws Exception { > SparkConf sparkConf = new SparkConf() > .setAppName("Example") > .setMaster("local[*]"); > JavaSparkContext sc = new JavaSparkContext(sparkConf); > SQLContext sqlContext = new SQLContext(sc); > List people = ImmutableList.of( > new Person("Joe", "Bloggs", 21, "NY") > ); > Dataset dataset = sqlContext.createDataset(people, > Encoders.bean(Person.class)); > {code} > {code:title=Person.java} > class Person implements Serializable { > String first; > String last; > int age; > String state; > public Person() { > } > public Person(String first, String last, int age, String state) { > this.first = first; > this.last = last; > this.age = age; > this.state = state; > } > public String getFirst() { > return first; > } > public String getLast() { > return last; > } > public int getAge() { > return age; > } > public String getState() { > return state; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12932) Bad error message with trying to create Dataset from RDD of Java objects that are not bean-compliant
[ https://issues.apache.org/jira/browse/SPARK-12932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12932: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > Bad error message with trying to create Dataset from RDD of Java objects that > are not bean-compliant > > > Key: SPARK-12932 > URL: https://issues.apache.org/jira/browse/SPARK-12932 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 1.6.0 > Environment: Ubuntu 15.10 / Java 8 >Reporter: Andy Grove >Priority: Minor > > When trying to create a Dataset from an RDD of Person (all using the Java > API), I got the error "java.lang.UnsupportedOperationException: no encoder > found for example_java.dataset.Person". This is not a very helpful error and > no other logging information was apparent to help troubleshoot this. > It turned out that the problem was that my Person class did not have a > default constructor and also did not have setter methods and that was the > root cause. > This JIRA is for implementing a more usful error message to help Java > developers who are trying out the Dataset API for the first time. > The full stack trace is: > {code} > Exception in thread "main" java.lang.UnsupportedOperationException: no > encoder found for example_java.common.Person > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$extractorFor(JavaTypeInference.scala:403) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.extractorsFor(JavaTypeInference.scala:314) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75) > at org.apache.spark.sql.Encoders$.bean(Encoder.scala:176) > at org.apache.spark.sql.Encoders.bean(Encoder.scala) > {code} > NOTE that if I do provide EITHER the default constructor OR the setters, but > not both, then I get a stack trace with much more useful information, but > omitting BOTH causes this issue. > The original source is below. > {code:title=Example.java} > public class JavaDatasetExample { > public static void main(String[] args) throws Exception { > SparkConf sparkConf = new SparkConf() > .setAppName("Example") > .setMaster("local[*]"); > JavaSparkContext sc = new JavaSparkContext(sparkConf); > SQLContext sqlContext = new SQLContext(sc); > List people = ImmutableList.of( > new Person("Joe", "Bloggs", 21, "NY") > ); > Dataset dataset = sqlContext.createDataset(people, > Encoders.bean(Person.class)); > {code} > {code:title=Person.java} > class Person implements Serializable { > String first; > String last; > int age; > String state; > public Person() { > } > public Person(String first, String last, int age, String state) { > this.first = first; > this.last = last; > this.age = age; > this.state = state; > } > public String getFirst() { > return first; > } > public String getLast() { > return last; > } > public int getAge() { > return age; > } > public String getState() { > return state; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior
[ https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110688#comment-15110688 ] Emlyn Corrin commented on SPARK-9740: - How do you use FIRST/LAST from the Java API with ignoreNulls now? I can't find a way to specify ignoreNulls=true. > first/last aggregate NULL behavior > -- > > Key: SPARK-9740 > URL: https://issues.apache.org/jira/browse/SPARK-9740 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Herman van Hovell >Assignee: Yin Huai > Labels: releasenotes > Fix For: 1.6.0 > > > The FIRST/LAST aggregates implemented as part of the new UDAF interface, > return the first or last non-null value (if any) found. This is a departure > from the behavior of the old FIRST/LAST aggregates and from the > FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, > if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' > this behavior for the old UDAF interface. > Hive makes this behavior configurable, by adding a skipNulls flag. I would > suggest to do the same, and make the default behavior compatible with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110676#comment-15110676 ] Sean Owen edited comment on SPARK-12741 at 1/21/16 2:40 PM: Wait, is this what you mean? {{select count(*) ...}} returns 1 row, which contains a number, which is the number of rows matching the predicate. {{count()}} returns 1 because there is 1 row in the result set. {{collect()}} collects (an Array of) that number of rows. Those are different; they're different things. That seems to be your problem. was (Author: srowen): Wait, is this what you mean? "select count(*) ..." returns 1 row, which contains a number, which is the number of rows matching the predicate. count() returns 1 because there is 1 row in the result set. collect() collects (an Array of) that number of rows. Those are different; they're different things. That seems to be your problem. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110709#comment-15110709 ] dileep commented on SPARK-12843: Please see the above Code. We need to make use of caching mechanism of the data frame. DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1"); teenagers.cache(); Which is making significant improvement in the select query. So for subsequent select query it wont select the entire data > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej BryĆski > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110709#comment-15110709 ] dileep edited comment on SPARK-12843 at 1/21/16 3:08 PM: - When I verified with 2 lakhs records, I am able to check the milliseconds difference through the program, which clearly says its not scanning entire records. But you improve the query performance by using dataframe object's cache method. I can see significant improvemnt in the performance of query. Please see the below Code snippet. We need to make use of caching mechanism of the data frame. DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1"); teenagers.cache(); Which is making significant improvement in the select query. So for subsequent select query it wont select the entire data was (Author: dileep): Please see the below Code snippet. We need to make use of caching mechanism of the data frame. DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1"); teenagers.cache(); Which is making significant improvement in the select query. So for subsequent select query it wont select the entire data > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej BryĆski > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12760) inaccurate description for difference between local vs cluster mode in closure handling
[ https://issues.apache.org/jira/browse/SPARK-12760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12760: Assignee: Apache Spark > inaccurate description for difference between local vs cluster mode in > closure handling > --- > > Key: SPARK-12760 > URL: https://issues.apache.org/jira/browse/SPARK-12760 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Mortada Mehyar >Assignee: Apache Spark >Priority: Minor > > In the spark documentation there's an example for illustrating how `local` > and `cluster` mode can differ > http://spark.apache.org/docs/latest/programming-guide.html#example > " In local mode with a single JVM, the above code will sum the values within > the RDD and store it in counter. This is because both the RDD and the > variable counter are in the same memory space on the driver node." > However the above doesn't seem to be true. Even in `local` mode it seems like > the counter value should still be 0, because the variable will be summed up > in the executor memory space, but the final value in the driver memory space > is still 0. I tested this snippet and verified that in `local` mode the value > is indeed still 0. > Is the doc wrong or perhaps I'm missing something the doc is trying to say? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12760) inaccurate description for difference between local vs cluster mode in closure handling
[ https://issues.apache.org/jira/browse/SPARK-12760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12760: Assignee: (was: Apache Spark) > inaccurate description for difference between local vs cluster mode in > closure handling > --- > > Key: SPARK-12760 > URL: https://issues.apache.org/jira/browse/SPARK-12760 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Mortada Mehyar >Priority: Minor > > In the spark documentation there's an example for illustrating how `local` > and `cluster` mode can differ > http://spark.apache.org/docs/latest/programming-guide.html#example > " In local mode with a single JVM, the above code will sum the values within > the RDD and store it in counter. This is because both the RDD and the > variable counter are in the same memory space on the driver node." > However the above doesn't seem to be true. Even in `local` mode it seems like > the counter value should still be 0, because the variable will be summed up > in the executor memory space, but the final value in the driver memory space > is still 0. I tested this snippet and verified that in `local` mode the value > is indeed still 0. > Is the doc wrong or perhaps I'm missing something the doc is trying to say? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110712#comment-15110712 ] dileep commented on SPARK-12843: Its a caching issue, while scanning the table need to cache the records, so from mext onwards it wont take much time > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej BryĆski > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10262) Add @Since annotation to ml.attribute
[ https://issues.apache.org/jira/browse/SPARK-10262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110743#comment-15110743 ] Tommy Yu commented on SPARK-10262: -- HI Xiangrui Meng I take a look all class under ml.attribute, those all are develop api. Only one no-develop api is "AttributeFactory" but it's private to this package either. I though this task should be closed without any PR need. Can you please take a look? Regards. Yu Wenpei. > Add @Since annotation to ml.attribute > - > > Key: SPARK-10262 > URL: https://issues.apache.org/jira/browse/SPARK-10262 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Priority: Minor > Labels: starter > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12760) inaccurate description for difference between local vs cluster mode in closure handling
[ https://issues.apache.org/jira/browse/SPARK-12760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110814#comment-15110814 ] Apache Spark commented on SPARK-12760: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/10866 > inaccurate description for difference between local vs cluster mode in > closure handling > --- > > Key: SPARK-12760 > URL: https://issues.apache.org/jira/browse/SPARK-12760 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Mortada Mehyar >Priority: Minor > > In the spark documentation there's an example for illustrating how `local` > and `cluster` mode can differ > http://spark.apache.org/docs/latest/programming-guide.html#example > " In local mode with a single JVM, the above code will sum the values within > the RDD and store it in counter. This is because both the RDD and the > variable counter are in the same memory space on the driver node." > However the above doesn't seem to be true. Even in `local` mode it seems like > the counter value should still be 0, because the variable will be summed up > in the executor memory space, but the final value in the driver memory space > is still 0. I tested this snippet and verified that in `local` mode the value > is indeed still 0. > Is the doc wrong or perhaps I'm missing something the doc is trying to say? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110701#comment-15110701 ] Sasi commented on SPARK-12741: -- That's not what I meant. I just set an example for each case, SQL way and DataFrame way. I know that count() on select count(*) return 1 row, that's why I wrote collect()[0] which give back the value. As I said before: dataFrame.count() and dataFrame.where("...").count() results wrong size when running dataFrame.where("...").collect().length or dataFrame.collect().length. I'm pretty sure that count() doesn't work as expected. Sasi > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110762#comment-15110762 ] Sean Owen commented on SPARK-12741: --- OK, that's what you wrote at the outset though. Then I can't reproduce it. I always get the correct count both ways. {{where("...")}} isn't what you're really executing; what are you writing? are you sure that's not the problem? because you have no predicate in the query you're comparing to. It's important to be clear what you're comparing. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110762#comment-15110762 ] Sean Owen edited comment on SPARK-12741 at 1/21/16 3:26 PM: OK, that's different from what you wrote at the outset though. Then I can't reproduce it. I always get the correct count both ways. {{where("...")}} isn't what you're really executing; what are you writing? are you sure that's not the problem? because you have no predicate in the query you're comparing to. It's important to be clear what you're comparing. was (Author: srowen): OK, that's what you wrote at the outset though. Then I can't reproduce it. I always get the correct count both ways. {{where("...")}} isn't what you're really executing; what are you writing? are you sure that's not the problem? because you have no predicate in the query you're comparing to. It's important to be clear what you're comparing. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110704#comment-15110704 ] dileep commented on SPARK-12843: public class JavaSparkSQL { public static class Person implements Serializable { private String name; private int age; public String getName() { return name; } public void setName(String name) { this.name = name; } public int getAge() { return age; } public void setAge(int age) { this.age = age; } } public static void main(String[] args) throws Exception { long millis1 = System.currentTimeMillis() % 1000; SparkConf sparkConf = new SparkConf().setAppName("JavaSparkSQL").setMaster("local[4]"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); SQLContext sqlContext = new SQLContext(ctx); // Load a text file and convert each line to a Java Bean. JavaRDD people = ctx.textFile("/home/394036/spark-1.6.0-bin-hadoop2.3/examples/src/main/resources/people_1.txt").map( new Function() { @Override public Person call(String line) { String[] parts = line.split(","); Person person = new Person(); person.setName(parts[0]); person.setAge(Integer.parseInt(parts[1].trim())); return person; } }); // Apply a schema to an RDD of Java Beans and register it as a table. DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class); schemaPeople.registerTempTable("people"); // SQL can be run over RDDs that have been registered as tables. //DataFrame teenagers = sqlContext.sql("SELECT age, name FROM people WHERE age >= 13 AND age <= 19"); //DataFrame teenagers = sqlContext.sql("SELECT * FROM people"); DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1"); teenagers.cache(); // The results of SQL queries are DataFrames and support all the normal RDD operations. // The columns of a row in the result can be accessed by ordinal. List teenagerNames = teenagers.toJavaRDD().map(new Function () { @Override public People call(Row row) {long millis2 = System.currentTimeMillis() % 1000; People people = new People(); people.setAge(row.getInt(0)); people.setName(row.getString(1)); //System.out.println(people.toString()); return people; } }).collect(); long millis2 = System.currentTimeMillis() % 1000; long millis3 = millis2 - millis1; System.out.println("difference = "+String.valueOf(millis3)); /* for (String name: teenagerNames) { System.out.println("=>"+name); } */ /* System.out.println("=== Data source: Parquet File ==="); // DataFrames can be saved as parquet files, maintaining the schema information. schemaPeople.write().parquet("people.parquet"); // Read in the parquet file created above. // Parquet files are self-describing so the schema is preserved. // The result of loading a parquet file is also a DataFrame. DataFrame parquetFile = sqlContext.read().parquet("people.parquet"); //Parquet files can also be registered as tables and then used in SQL statements. parquetFile.registerTempTable("parquetFile"); DataFrame teenagers2 = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19"); teenagerNames = teenagers2.toJavaRDD().map(new Function
() { @Override public String call(Row row) { return "Name: " + row.getString(0); } }).collect(); for (String name: teenagerNames) { System.out.println(name); } System.out.println("=== Data source: JSON Dataset ==="); // A JSON dataset is pointed by path. // The path can be either a single text file or a directory storing text files. String path = "/home/394036/spark-1.6.0-bin-hadoop2.3/examples/src/main/resources/people.json"; // Create a DataFrame from the file(s) pointed by path DataFrame peopleFromJsonFile = sqlContext.read().json(path); // Because the schema of a JSON dataset is automatically inferred, to write queries, // it is better to take a look at what is the schema. peopleFromJsonFile.printSchema(); // The schema of people is ... // root // |-- age: IntegerType // |-- name: StringType // Register this DataFrame as a table. peopleFromJsonFile.registerTempTable("people"); // SQL statements can be run by using the sql methods provided by sqlContext. DataFrame teenagers3 = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19"); // The results of SQL queries are DataFrame and support all the normal RDD operations. // The columns of a row in the result can be accessed by ordinal.
[jira] [Issue Comment Deleted] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dileep updated SPARK-12843: --- Comment: was deleted (was: public class JavaSparkSQL { public static class Person implements Serializable { private String name; private int age; public String getName() { return name; } public void setName(String name) { this.name = name; } public int getAge() { return age; } public void setAge(int age) { this.age = age; } } public static void main(String[] args) throws Exception { long millis1 = System.currentTimeMillis() % 1000; SparkConf sparkConf = new SparkConf().setAppName("JavaSparkSQL").setMaster("local[4]"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); SQLContext sqlContext = new SQLContext(ctx); // Load a text file and convert each line to a Java Bean. JavaRDD people = ctx.textFile("/home/394036/spark-1.6.0-bin-hadoop2.3/examples/src/main/resources/people_1.txt").map( new Function() { @Override public Person call(String line) { String[] parts = line.split(","); Person person = new Person(); person.setName(parts[0]); person.setAge(Integer.parseInt(parts[1].trim())); return person; } }); // Apply a schema to an RDD of Java Beans and register it as a table. DataFrame schemaPeople = sqlContext.createDataFrame(people, Person.class); schemaPeople.registerTempTable("people"); // SQL can be run over RDDs that have been registered as tables. //DataFrame teenagers = sqlContext.sql("SELECT age, name FROM people WHERE age >= 13 AND age <= 19"); //DataFrame teenagers = sqlContext.sql("SELECT * FROM people"); DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1"); teenagers.cache(); // The results of SQL queries are DataFrames and support all the normal RDD operations. // The columns of a row in the result can be accessed by ordinal. List teenagerNames = teenagers.toJavaRDD().map(new Function () { @Override public People call(Row row) {long millis2 = System.currentTimeMillis() % 1000; People people = new People(); people.setAge(row.getInt(0)); people.setName(row.getString(1)); //System.out.println(people.toString()); return people; } }).collect(); long millis2 = System.currentTimeMillis() % 1000; long millis3 = millis2 - millis1; System.out.println("difference = "+String.valueOf(millis3)); /* for (String name: teenagerNames) { System.out.println("=>"+name); } */ /* System.out.println("=== Data source: Parquet File ==="); // DataFrames can be saved as parquet files, maintaining the schema information. schemaPeople.write().parquet("people.parquet"); // Read in the parquet file created above. // Parquet files are self-describing so the schema is preserved. // The result of loading a parquet file is also a DataFrame. DataFrame parquetFile = sqlContext.read().parquet("people.parquet"); //Parquet files can also be registered as tables and then used in SQL statements. parquetFile.registerTempTable("parquetFile"); DataFrame teenagers2 = sqlContext.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19"); teenagerNames = teenagers2.toJavaRDD().map(new Function
() { @Override public String call(Row row) { return "Name: " + row.getString(0); } }).collect(); for (String name: teenagerNames) { System.out.println(name); } System.out.println("=== Data source: JSON Dataset ==="); // A JSON dataset is pointed by path. // The path can be either a single text file or a directory storing text files. String path = "/home/394036/spark-1.6.0-bin-hadoop2.3/examples/src/main/resources/people.json"; // Create a DataFrame from the file(s) pointed by path DataFrame peopleFromJsonFile = sqlContext.read().json(path); // Because the schema of a JSON dataset is automatically inferred, to write queries, // it is better to take a look at what is the schema. peopleFromJsonFile.printSchema(); // The schema of people is ... // root // |-- age: IntegerType // |-- name: StringType // Register this DataFrame as a table. peopleFromJsonFile.registerTempTable("people"); // SQL statements can be run by using the sql methods provided by sqlContext. DataFrame teenagers3 = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19"); // The results of SQL queries are DataFrame and support all the normal RDD operations. // The columns of a row in the result can be accessed by ordinal. teenagerNames =
[jira] [Commented] (SPARK-12932) Bad error message with trying to create Dataset from RDD of Java objects that are not bean-compliant
[ https://issues.apache.org/jira/browse/SPARK-12932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110772#comment-15110772 ] Andy Grove commented on SPARK-12932: After reviewing the code for this, I think it is just a case of changing the error message text in `org.apache.spark.sql.catalyst.JavaTypeInference#extractorFor`. The error should be changed to: {code} s"Cannot infer type for Java class ${other.getName} because it is not bean-compliant" {code} > Bad error message with trying to create Dataset from RDD of Java objects that > are not bean-compliant > > > Key: SPARK-12932 > URL: https://issues.apache.org/jira/browse/SPARK-12932 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.6.0 > Environment: Ubuntu 15.10 / Java 8 >Reporter: Andy Grove > > When trying to create a Dataset from an RDD of Person (all using the Java > API), I got the error "java.lang.UnsupportedOperationException: no encoder > found for example_java.dataset.Person". This is not a very helpful error and > no other logging information was apparent to help troubleshoot this. > It turned out that the problem was that my Person class did not have a > default constructor and also did not have setter methods and that was the > root cause. > This JIRA is for implementing a more usful error message to help Java > developers who are trying out the Dataset API for the first time. > The full stack trace is: > {code} > Exception in thread "main" java.lang.UnsupportedOperationException: no > encoder found for example_java.common.Person > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$extractorFor(JavaTypeInference.scala:403) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.extractorsFor(JavaTypeInference.scala:314) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75) > at org.apache.spark.sql.Encoders$.bean(Encoder.scala:176) > at org.apache.spark.sql.Encoders.bean(Encoder.scala) > {code} > NOTE that if I do provide EITHER the default constructor OR the setters, but > not both, then I get a stack trace with much more useful information, but > omitting BOTH causes this issue. > The original source is below. > {code:title=Example.java} > public class JavaDatasetExample { > public static void main(String[] args) throws Exception { > SparkConf sparkConf = new SparkConf() > .setAppName("Example") > .setMaster("local[*]"); > JavaSparkContext sc = new JavaSparkContext(sparkConf); > SQLContext sqlContext = new SQLContext(sc); > List people = ImmutableList.of( > new Person("Joe", "Bloggs", 21, "NY") > ); > Dataset dataset = sqlContext.createDataset(people, > Encoders.bean(Person.class)); > {code} > {code:title=Person.java} > class Person implements Serializable { > String first; > String last; > int age; > String state; > public Person() { > } > public Person(String first, String last, int age, String state) { > this.first = first; > this.last = last; > this.age = age; > this.state = state; > } > public String getFirst() { > return first; > } > public String getLast() { > return last; > } > public int getAge() { > return age; > } > public String getState() { > return state; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12534) Document missing command line options to Spark properties mapping
[ https://issues.apache.org/jira/browse/SPARK-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12534. --- Resolution: Fixed Fix Version/s: 2.0.0 Resolved by https://github.com/apache/spark/pull/10491 > Document missing command line options to Spark properties mapping > - > > Key: SPARK-12534 > URL: https://issues.apache.org/jira/browse/SPARK-12534 > Project: Spark > Issue Type: Bug > Components: Deploy, Documentation, YARN >Affects Versions: 1.5.2 >Reporter: Felix Cheung >Assignee: Apache Spark >Priority: Minor > Fix For: 2.0.0 > > > Several Spark properties equivalent to Spark submit command line options are > missing. > {quote} > The equivalent for spark-submit --num-executors should be > spark.executor.instances > When use in SparkConf? > http://spark.apache.org/docs/latest/running-on-yarn.html > Could you try setting that with sparkR.init()? > _ > From: Franc Carter> Sent: Friday, December 25, 2015 9:23 PM > Subject: number of executors in sparkR.init() > To: > Hi, > I'm having trouble working out how to get the number of executors set when > using sparkR.init(). > If I start sparkR with > sparkR --master yarn --num-executors 6 > then I get 6 executors > However, if start sparkR with > sparkR > followed by > sc <- sparkR.init(master="yarn-client", > sparkEnvir=list(spark.num.executors='6')) > then I only get 2 executors. > Can anyone point me in the direction of what I might doing wrong ? I need to > initialise this was so that rStudio can hook in to SparkR > thanks > -- > Franc > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110712#comment-15110712 ] dileep edited comment on SPARK-12843 at 1/21/16 2:59 PM: - Its a caching issue, while scanning the table need to cache the Data Frame, so from next onwards it wont take much time was (Author: dileep): Its a caching issue, while scanning the table need to cache the records, so from next onwards it wont take much time > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej BryĆski > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12932) Bad error message with trying to create Dataset from RDD of Java objects that are not bean-compliant
[ https://issues.apache.org/jira/browse/SPARK-12932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110785#comment-15110785 ] Andy Grove commented on SPARK-12932: Here is a pull request to change the error message: https://github.com/apache/spark/pull/10865 > Bad error message with trying to create Dataset from RDD of Java objects that > are not bean-compliant > > > Key: SPARK-12932 > URL: https://issues.apache.org/jira/browse/SPARK-12932 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 1.6.0 > Environment: Ubuntu 15.10 / Java 8 >Reporter: Andy Grove > > When trying to create a Dataset from an RDD of Person (all using the Java > API), I got the error "java.lang.UnsupportedOperationException: no encoder > found for example_java.dataset.Person". This is not a very helpful error and > no other logging information was apparent to help troubleshoot this. > It turned out that the problem was that my Person class did not have a > default constructor and also did not have setter methods and that was the > root cause. > This JIRA is for implementing a more usful error message to help Java > developers who are trying out the Dataset API for the first time. > The full stack trace is: > {code} > Exception in thread "main" java.lang.UnsupportedOperationException: no > encoder found for example_java.common.Person > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$extractorFor(JavaTypeInference.scala:403) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.extractorsFor(JavaTypeInference.scala:314) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75) > at org.apache.spark.sql.Encoders$.bean(Encoder.scala:176) > at org.apache.spark.sql.Encoders.bean(Encoder.scala) > {code} > NOTE that if I do provide EITHER the default constructor OR the setters, but > not both, then I get a stack trace with much more useful information, but > omitting BOTH causes this issue. > The original source is below. > {code:title=Example.java} > public class JavaDatasetExample { > public static void main(String[] args) throws Exception { > SparkConf sparkConf = new SparkConf() > .setAppName("Example") > .setMaster("local[*]"); > JavaSparkContext sc = new JavaSparkContext(sparkConf); > SQLContext sqlContext = new SQLContext(sc); > List people = ImmutableList.of( > new Person("Joe", "Bloggs", 21, "NY") > ); > Dataset dataset = sqlContext.createDataset(people, > Encoders.bean(Person.class)); > {code} > {code:title=Person.java} > class Person implements Serializable { > String first; > String last; > int age; > String state; > public Person() { > } > public Person(String first, String last, int age, String state) { > this.first = first; > this.last = last; > this.age = age; > this.state = state; > } > public String getFirst() { > return first; > } > public String getLast() { > return last; > } > public int getAge() { > return age; > } > public String getState() { > return state; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12534) Document missing command line options to Spark properties mapping
[ https://issues.apache.org/jira/browse/SPARK-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12534: -- Assignee: Felix Cheung (was: Apache Spark) > Document missing command line options to Spark properties mapping > - > > Key: SPARK-12534 > URL: https://issues.apache.org/jira/browse/SPARK-12534 > Project: Spark > Issue Type: Bug > Components: Deploy, Documentation, YARN >Affects Versions: 1.5.2 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Minor > Fix For: 2.0.0 > > > Several Spark properties equivalent to Spark submit command line options are > missing. > {quote} > The equivalent for spark-submit --num-executors should be > spark.executor.instances > When use in SparkConf? > http://spark.apache.org/docs/latest/running-on-yarn.html > Could you try setting that with sparkR.init()? > _ > From: Franc Carter> Sent: Friday, December 25, 2015 9:23 PM > Subject: number of executors in sparkR.init() > To: > Hi, > I'm having trouble working out how to get the number of executors set when > using sparkR.init(). > If I start sparkR with > sparkR --master yarn --num-executors 6 > then I get 6 executors > However, if start sparkR with > sparkR > followed by > sc <- sparkR.init(master="yarn-client", > sparkEnvir=list(spark.num.executors='6')) > then I only get 2 executors. > Can anyone point me in the direction of what I might doing wrong ? I need to > initialise this was so that rStudio can hook in to SparkR > thanks > -- > Franc > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12932) Bad error message with trying to create Dataset from RDD of Java objects that are not bean-compliant
[ https://issues.apache.org/jira/browse/SPARK-12932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12932: Assignee: Apache Spark > Bad error message with trying to create Dataset from RDD of Java objects that > are not bean-compliant > > > Key: SPARK-12932 > URL: https://issues.apache.org/jira/browse/SPARK-12932 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 1.6.0 > Environment: Ubuntu 15.10 / Java 8 >Reporter: Andy Grove >Assignee: Apache Spark >Priority: Minor > > When trying to create a Dataset from an RDD of Person (all using the Java > API), I got the error "java.lang.UnsupportedOperationException: no encoder > found for example_java.dataset.Person". This is not a very helpful error and > no other logging information was apparent to help troubleshoot this. > It turned out that the problem was that my Person class did not have a > default constructor and also did not have setter methods and that was the > root cause. > This JIRA is for implementing a more usful error message to help Java > developers who are trying out the Dataset API for the first time. > The full stack trace is: > {code} > Exception in thread "main" java.lang.UnsupportedOperationException: no > encoder found for example_java.common.Person > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$extractorFor(JavaTypeInference.scala:403) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.extractorsFor(JavaTypeInference.scala:314) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75) > at org.apache.spark.sql.Encoders$.bean(Encoder.scala:176) > at org.apache.spark.sql.Encoders.bean(Encoder.scala) > {code} > NOTE that if I do provide EITHER the default constructor OR the setters, but > not both, then I get a stack trace with much more useful information, but > omitting BOTH causes this issue. > The original source is below. > {code:title=Example.java} > public class JavaDatasetExample { > public static void main(String[] args) throws Exception { > SparkConf sparkConf = new SparkConf() > .setAppName("Example") > .setMaster("local[*]"); > JavaSparkContext sc = new JavaSparkContext(sparkConf); > SQLContext sqlContext = new SQLContext(sc); > List people = ImmutableList.of( > new Person("Joe", "Bloggs", 21, "NY") > ); > Dataset dataset = sqlContext.createDataset(people, > Encoders.bean(Person.class)); > {code} > {code:title=Person.java} > class Person implements Serializable { > String first; > String last; > int age; > String state; > public Person() { > } > public Person(String first, String last, int age, String state) { > this.first = first; > this.last = last; > this.age = age; > this.state = state; > } > public String getFirst() { > return first; > } > public String getLast() { > return last; > } > public int getAge() { > return age; > } > public String getState() { > return state; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12932) Bad error message with trying to create Dataset from RDD of Java objects that are not bean-compliant
[ https://issues.apache.org/jira/browse/SPARK-12932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12932: Assignee: (was: Apache Spark) > Bad error message with trying to create Dataset from RDD of Java objects that > are not bean-compliant > > > Key: SPARK-12932 > URL: https://issues.apache.org/jira/browse/SPARK-12932 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 1.6.0 > Environment: Ubuntu 15.10 / Java 8 >Reporter: Andy Grove >Priority: Minor > > When trying to create a Dataset from an RDD of Person (all using the Java > API), I got the error "java.lang.UnsupportedOperationException: no encoder > found for example_java.dataset.Person". This is not a very helpful error and > no other logging information was apparent to help troubleshoot this. > It turned out that the problem was that my Person class did not have a > default constructor and also did not have setter methods and that was the > root cause. > This JIRA is for implementing a more usful error message to help Java > developers who are trying out the Dataset API for the first time. > The full stack trace is: > {code} > Exception in thread "main" java.lang.UnsupportedOperationException: no > encoder found for example_java.common.Person > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$extractorFor(JavaTypeInference.scala:403) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.extractorsFor(JavaTypeInference.scala:314) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75) > at org.apache.spark.sql.Encoders$.bean(Encoder.scala:176) > at org.apache.spark.sql.Encoders.bean(Encoder.scala) > {code} > NOTE that if I do provide EITHER the default constructor OR the setters, but > not both, then I get a stack trace with much more useful information, but > omitting BOTH causes this issue. > The original source is below. > {code:title=Example.java} > public class JavaDatasetExample { > public static void main(String[] args) throws Exception { > SparkConf sparkConf = new SparkConf() > .setAppName("Example") > .setMaster("local[*]"); > JavaSparkContext sc = new JavaSparkContext(sparkConf); > SQLContext sqlContext = new SQLContext(sc); > List people = ImmutableList.of( > new Person("Joe", "Bloggs", 21, "NY") > ); > Dataset dataset = sqlContext.createDataset(people, > Encoders.bean(Person.class)); > {code} > {code:title=Person.java} > class Person implements Serializable { > String first; > String last; > int age; > String state; > public Person() { > } > public Person(String first, String last, int age, String state) { > this.first = first; > this.last = last; > this.age = age; > this.state = state; > } > public String getFirst() { > return first; > } > public String getLast() { > return last; > } > public int getAge() { > return age; > } > public String getState() { > return state; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12932) Bad error message with trying to create Dataset from RDD of Java objects that are not bean-compliant
[ https://issues.apache.org/jira/browse/SPARK-12932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110796#comment-15110796 ] Apache Spark commented on SPARK-12932: -- User 'andygrove' has created a pull request for this issue: https://github.com/apache/spark/pull/10865 > Bad error message with trying to create Dataset from RDD of Java objects that > are not bean-compliant > > > Key: SPARK-12932 > URL: https://issues.apache.org/jira/browse/SPARK-12932 > Project: Spark > Issue Type: Improvement > Components: Java API >Affects Versions: 1.6.0 > Environment: Ubuntu 15.10 / Java 8 >Reporter: Andy Grove >Priority: Minor > > When trying to create a Dataset from an RDD of Person (all using the Java > API), I got the error "java.lang.UnsupportedOperationException: no encoder > found for example_java.dataset.Person". This is not a very helpful error and > no other logging information was apparent to help troubleshoot this. > It turned out that the problem was that my Person class did not have a > default constructor and also did not have setter methods and that was the > root cause. > This JIRA is for implementing a more usful error message to help Java > developers who are trying out the Dataset API for the first time. > The full stack trace is: > {code} > Exception in thread "main" java.lang.UnsupportedOperationException: no > encoder found for example_java.common.Person > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$extractorFor(JavaTypeInference.scala:403) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.extractorsFor(JavaTypeInference.scala:314) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75) > at org.apache.spark.sql.Encoders$.bean(Encoder.scala:176) > at org.apache.spark.sql.Encoders.bean(Encoder.scala) > {code} > NOTE that if I do provide EITHER the default constructor OR the setters, but > not both, then I get a stack trace with much more useful information, but > omitting BOTH causes this issue. > The original source is below. > {code:title=Example.java} > public class JavaDatasetExample { > public static void main(String[] args) throws Exception { > SparkConf sparkConf = new SparkConf() > .setAppName("Example") > .setMaster("local[*]"); > JavaSparkContext sc = new JavaSparkContext(sparkConf); > SQLContext sqlContext = new SQLContext(sc); > List people = ImmutableList.of( > new Person("Joe", "Bloggs", 21, "NY") > ); > Dataset dataset = sqlContext.createDataset(people, > Encoders.bean(Person.class)); > {code} > {code:title=Person.java} > class Person implements Serializable { > String first; > String last; > int age; > String state; > public Person() { > } > public Person(String first, String last, int age, String state) { > this.first = first; > this.last = last; > this.age = age; > this.state = state; > } > public String getFirst() { > return first; > } > public String getLast() { > return last; > } > public int getAge() { > return age; > } > public String getState() { > return state; > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110676#comment-15110676 ] Sean Owen commented on SPARK-12741: --- Wait, is this what you mean? "select count(*) ..." returns 1 row, which contains a number, which is the number of rows matching the predicate. count() returns 1 because there is 1 row in the result set. collect() collects (an Array of) that number of rows. Those are different; they're different things. That seems to be your problem. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12741) DataFrame count method return wrong size.
[ https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110676#comment-15110676 ] Sean Owen edited comment on SPARK-12741 at 1/21/16 2:40 PM: Wait, is this what you mean? {code} select count(*) ... {code} returns 1 row, which contains a number, which is the number of rows matching the predicate. {{count()}} returns 1 because there is 1 row in the result set. {{collect()}} collects (an Array of) that number of rows. Those are different; they're different things. That seems to be your problem. was (Author: srowen): Wait, is this what you mean? {{select count(*) ...}} returns 1 row, which contains a number, which is the number of rows matching the predicate. {{count()}} returns 1 because there is 1 row in the result set. {{collect()}} collects (an Array of) that number of rows. Those are different; they're different things. That seems to be your problem. > DataFrame count method return wrong size. > - > > Key: SPARK-12741 > URL: https://issues.apache.org/jira/browse/SPARK-12741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Sasi > > Hi, > I'm updating my report. > I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I > have 2 method, one for collect data and other for count. > method doQuery looks like: > {code} > dataFrame.collect() > {code} > method doQueryCount looks like: > {code} > dataFrame.count() > {code} > I have few scenarios with few results: > 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0 > 2) 3 rows exists results: count 0 and collect 3. > 3) 5 rows exists results: count 2 and collect 5. > I tried to change the count code to the below code, but got the same results > as I mentioned above. > {code} > dataFrame.sql("select count(*) from tbl").count/collect[0] > {code} > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110709#comment-15110709 ] dileep edited comment on SPARK-12843 at 1/21/16 2:57 PM: - Please see the below Code snippet. We need to make use of caching mechanism of the data frame. DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1"); teenagers.cache(); Which is making significant improvement in the select query. So for subsequent select query it wont select the entire data was (Author: dileep): Please see the above Code. We need to make use of caching mechanism of the data frame. DataFrame teenagers = sqlContext.sql("SELECT * FROM people limit 1"); teenagers.cache(); Which is making significant improvement in the select query. So for subsequent select query it wont select the entire data > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej BryĆski > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12534) Document missing command line options to Spark properties mapping
[ https://issues.apache.org/jira/browse/SPARK-12534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12534: -- Issue Type: Improvement (was: Bug) > Document missing command line options to Spark properties mapping > - > > Key: SPARK-12534 > URL: https://issues.apache.org/jira/browse/SPARK-12534 > Project: Spark > Issue Type: Improvement > Components: Deploy, Documentation, YARN >Affects Versions: 1.5.2 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Minor > Fix For: 2.0.0 > > > Several Spark properties equivalent to Spark submit command line options are > missing. > {quote} > The equivalent for spark-submit --num-executors should be > spark.executor.instances > When use in SparkConf? > http://spark.apache.org/docs/latest/running-on-yarn.html > Could you try setting that with sparkR.init()? > _ > From: Franc Carter> Sent: Friday, December 25, 2015 9:23 PM > Subject: number of executors in sparkR.init() > To: > Hi, > I'm having trouble working out how to get the number of executors set when > using sparkR.init(). > If I start sparkR with > sparkR --master yarn --num-executors 6 > then I get 6 executors > However, if start sparkR with > sparkR > followed by > sc <- sparkR.init(master="yarn-client", > sparkEnvir=list(spark.num.executors='6')) > then I only get 2 executors. > Can anyone point me in the direction of what I might doing wrong ? I need to > initialise this was so that rStudio can hook in to SparkR > thanks > -- > Franc > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5929) Pyspark: Register a pip requirements file with spark_context
[ https://issues.apache.org/jira/browse/SPARK-5929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5929. -- Resolution: Won't Fix > Pyspark: Register a pip requirements file with spark_context > > > Key: SPARK-5929 > URL: https://issues.apache.org/jira/browse/SPARK-5929 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: buckhx >Priority: Minor > > I've been doing a lot of dependency work with shipping dependencies to > workers as it is non-trivial for me to have my workers include the proper > dependencies in their own environments. > To circumvent this, I added a addRequirementsFile() method that takes a pip > requirements file, downloads the packages, repackages them to be registered > with addPyFiles and ship them to workers. > Here is a comparison of what I've done on the Palantir fork > https://github.com/buckheroux/spark/compare/palantir:master...master -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5647) Output metrics do not show up for older hadoop versions (< 2.5)
[ https://issues.apache.org/jira/browse/SPARK-5647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5647. -- Resolution: Duplicate I think this is nearly moot for Spark 2.x, given that Hadoop support may get to 2.4+ or 2.5+ only. I also don't see movement on this in a year. > Output metrics do not show up for older hadoop versions (< 2.5) > --- > > Key: SPARK-5647 > URL: https://issues.apache.org/jira/browse/SPARK-5647 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Kostas Sakellis >Priority: Critical > > Need to add output metrics for hadoop < 2.5. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4247) [SQL] use beeline execute "create table as" thriftserver is not use "hive" user ,but the new hdfs dir's owner is "hive"
[ https://issues.apache.org/jira/browse/SPARK-4247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4247. -- Resolution: Not A Problem I think this is stale anyway, but I think this is a question about Hive and how it sets up directories and permissions. > [SQL] use beeline execute "create table as" thriftserver is not use "hive" > user ,but the new hdfs dir's owner is "hive" > --- > > Key: SPARK-4247 > URL: https://issues.apache.org/jira/browse/SPARK-4247 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.1 > Environment: java: 1.7 > hadoop: 2.3.0-cdh5.0.0 > spark: branch-1.1 lastest > thriftserver with hive 0.12 > hive :0.13.1 > compile cmd: > sh make-distribution.sh --tgz -Phadoop-provided -Pyarn -DskipTests > -Dhadoop.version=2.3.0-cdh5.0.0 -Phive >Reporter: qiaohaijun > > thriftserver start cmd: > sudo -u ultraman sh start-thriftserver.sh > --- > beeline start cmd: > sh beeline -u jdbc:hive2://x.x.x.x:1 -n ultraman -p ** > --- > sql: > create table ultraman_tmp.test as select channel, subchannel from > custom.common_pc_pv where logdate >= '2014110210' and logdate <= '2014110210' > limit 10; > the hdfs dir is follow: > drwxr-xr-x - hive hdfs 0 2014-11-03 18:02 > /user/hive/warehouse/ultraman_tmp.db/test > > 2014-11-03 20:12:52,498 INFO [pool-10-thread-3] hive.metastore > (HiveMetaStoreClient.java:open(244)) - Trying to connect to metastore with > URI http://10.141.77.221:9083 > 2014-11-03 20:12:52,509 INFO [pool-10-thread-3] hive.metastore > (HiveMetaStoreClient.java:open(322)) - Waiting 1 seconds before next > connection attempt. > 2014-11-03 20:12:53,510 INFO [pool-10-thread-3] hive.metastore > (HiveMetaStoreClient.java:open(332)) - Connected to metastore. > 2014-11-03 20:12:53,899 ERROR [pool-10-thread-3] > server.SparkSQLOperationManager (Logging.scala:logError(96)) - Error > executing query: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move > sourceviewfs://nsX/user/hive/datadir-tmp/hive_2014-11-03_20-12-43_561_4822588651544736505-2/-ext-1/_SUCCESS > to destination > /user/hive/warehouse/ultraman_tmp.db/litao_sparksql_test_9/_SUCCESS > at org.apache.hadoop.hive.ql.metadata.Hive.renameFile(Hive.java:2173) > at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2227) > at org.apache.hadoop.hive.ql.metadata.Table.copyFiles(Table.java:652) > at org.apache.hadoop.hive.ql.metadata.Hive.loadTable(Hive.java:1443) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.result$lzycompute(InsertIntoHiveTable.scala:243) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.result(InsertIntoHiveTable.scala:171) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.execute(InsertIntoHiveTable.scala:162) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:360) > at > org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:360) > at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58) > at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:103) > at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:98) > at > org.apache.spark.sql.hive.thriftserver.server.SparkSQLOperationManager$$anon$1.run(SparkSQLOperationManager.scala:172) > at > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:193) > at > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatement(HiveSessionImpl.java:175) > at > org.apache.hive.service.cli.CLIService.executeStatement(CLIService.java:150) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:207) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1133) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1118) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:58) > at > org.apache.hive.service.auth.TUGIContainingProcessor$1.run(TUGIContainingProcessor.java:55) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1697) > at > org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:526) > at > org.apache.hive.service.auth.TUGIContainingProcessor.process(TUGIContainingProcessor.java:55) > at >
[jira] [Commented] (SPARK-12941) Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype
[ https://issues.apache.org/jira/browse/SPARK-12941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110906#comment-15110906 ] Jose Martinez Poblete commented on SPARK-12941: --- Thanks, let us know if this can be worked out on 1.4 > Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR > datatype > -- > > Key: SPARK-12941 > URL: https://issues.apache.org/jira/browse/SPARK-12941 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 > Environment: Apache Spark 1.4.2.2 >Reporter: Jose Martinez Poblete > > When exporting data from Spark to Oracle, string datatypes are translated to > TEXT for Oracle, this is leading to the following error > {noformat} > java.sql.SQLSyntaxErrorException: ORA-00902: invalid datatype > {noformat} > As per the following code: > https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L144 > See also: > http://stackoverflow.com/questions/31287182/writing-to-oracle-database-using-apache-spark-1-4-0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12956) add spark.yarn.hdfs.home.directory property
PJ Fanning created SPARK-12956: -- Summary: add spark.yarn.hdfs.home.directory property Key: SPARK-12956 URL: https://issues.apache.org/jira/browse/SPARK-12956 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.6.0 Reporter: PJ Fanning https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala uses the default home directory based on the hadoop configuration. I have a use case where it would be useful to override this and to provide an explicit base path. If this seems like a generally use config property, I can put together a pull request. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12946) The SQL page is empty
[ https://issues.apache.org/jira/browse/SPARK-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12946: -- Target Version/s: (was: 1.6.1) Fix Version/s: (was: 1.6.1) > The SQL page is empty > - > > Key: SPARK-12946 > URL: https://issues.apache.org/jira/browse/SPARK-12946 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: KaiXinXIaoLei > Attachments: SQLpage.png > > > I run sql query using "bin/spark-sql --master yarn". Then i open the ui , > and find the SQL page is empty -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6137) G-Means clustering algorithm implementation
[ https://issues.apache.org/jira/browse/SPARK-6137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6137. -- Resolution: Won't Fix > G-Means clustering algorithm implementation > --- > > Key: SPARK-6137 > URL: https://issues.apache.org/jira/browse/SPARK-6137 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Denis Dus >Priority: Minor > Labels: clustering > > Will it be useful to implement G-Means clustering algorithm based on K-Means? > G-means is a powerful extension of k-means, which uses test of cluster data > normality to decide if it necessary to split current cluster into new two. > It's relative complexity (compared to k-Means) is O(K), where K is maximum > number of clusters. > The original paper is by Greg Hamerly and Charles Elkan from University of > California: > [http://papers.nips.cc/paper/2526-learning-the-k-in-k-means.pdf] > I also have a small prototype of this algorithm written in R (if anyone is > interested in it). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6056) Unlimit offHeap memory use cause RM killing the container
[ https://issues.apache.org/jira/browse/SPARK-6056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6056. -- Resolution: Not A Problem > Unlimit offHeap memory use cause RM killing the container > - > > Key: SPARK-6056 > URL: https://issues.apache.org/jira/browse/SPARK-6056 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.1 >Reporter: SaintBacchus > > No matter set the `preferDirectBufs` or limit the number of thread or not > ,spark can not limit the use of offheap memory. > At line 269 of the class 'AbstractNioByteChannel' in netty-4.0.23.Final, > Netty had allocated a offheap memory buffer with the same size in heap. > So how many buffer you want to transfor, the same size offheap memory will be > allocated. > But once the allocated memory size reach the capacity of the overhead momery > set in yarn, this executor will be killed. > I wrote a simple code to test it: > {code:title=test.scala|borderStyle=solid} > import org.apache.spark.storage._ > import org.apache.spark._ > val bufferRdd = sc.makeRDD(0 to 10, 10).map(x=>new > Array[Byte](10*1024*1024)).persist > bufferRdd.count > val part = bufferRdd.partitions(0) > val sparkEnv = SparkEnv.get > val blockMgr = sparkEnv.blockManager > def test = { > val blockOption = blockMgr.get(RDDBlockId(bufferRdd.id, part.index)) > val resultIt = > blockOption.get.data.asInstanceOf[Iterator[Array[Byte]]] > val len = resultIt.map(_.length).sum > println(s"[${Thread.currentThread.getId}] get block length = $len") > } > def test_driver(count:Int, parallel:Int)(f: => Unit) = { > val tpool = new scala.concurrent.forkjoin.ForkJoinPool(parallel) > val taskSupport = new > scala.collection.parallel.ForkJoinTaskSupport(tpool) > val parseq = (1 to count).par > parseq.tasksupport = taskSupport > parseq.foreach(x=>f) > tpool.shutdown > tpool.awaitTermination(100, java.util.concurrent.TimeUnit.SECONDS) > } > {code} > progress: > 1. bin/spark-shell --master yarn-cilent --executor-cores 40 --num-executors 1 > 2. :load test.scala in spark-shell > 3. use such comman to catch executor on slave node > {code} > pid=$(jps|grep CoarseGrainedExecutorBackend |awk '{print $1}');top -b -p > $pid|grep $pid > {code} > 4. test_driver(20,100)(test) in spark-shell > 5. watch the output of the command on slave node > If use multi-thread to get len, the physical memery will soon exceed the > limit set by spark.yarn.executor.memoryOverhead -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4878) driverPropsFetcher causes spurious Akka disassociate errors
[ https://issues.apache.org/jira/browse/SPARK-4878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110865#comment-15110865 ] Sean Owen commented on SPARK-4878: -- I think this may be defunct anyway, but, the code in question does still exist. I don't see these errors in the logs of tests for example. > driverPropsFetcher causes spurious Akka disassociate errors > --- > > Key: SPARK-4878 > URL: https://issues.apache.org/jira/browse/SPARK-4878 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Stephen Haberman >Priority: Minor > > The dedicated Akka system to fetching driver properties seems fine, but it > leads to very misleading "AssociationHandle$Disassociated", dead letter, etc. > sort of messages that can lead the user to believe something is wrong with > the cluster. > (E.g. personally I thought it was a Spark -rc1/-rc2 bug and spent awhile > poking around until I saw in the code that driverPropsFetcher is > purposefully/immediately shutdown.) > Is there any way to cleanly shutdown that initial akka system so that the > driver doesn't log these errors? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12760) inaccurate description for difference between local vs cluster mode in closure handling
[ https://issues.apache.org/jira/browse/SPARK-12760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110908#comment-15110908 ] Apache Spark commented on SPARK-12760: -- User 'mortada' has created a pull request for this issue: https://github.com/apache/spark/pull/10867 > inaccurate description for difference between local vs cluster mode in > closure handling > --- > > Key: SPARK-12760 > URL: https://issues.apache.org/jira/browse/SPARK-12760 > Project: Spark > Issue Type: Bug > Components: Documentation >Reporter: Mortada Mehyar >Priority: Minor > > In the spark documentation there's an example for illustrating how `local` > and `cluster` mode can differ > http://spark.apache.org/docs/latest/programming-guide.html#example > " In local mode with a single JVM, the above code will sum the values within > the RDD and store it in counter. This is because both the RDD and the > variable counter are in the same memory space on the driver node." > However the above doesn't seem to be true. Even in `local` mode it seems like > the counter value should still be 0, because the variable will be summed up > in the executor memory space, but the final value in the driver memory space > is still 0. I tested this snippet and verified that in `local` mode the value > is indeed still 0. > Is the doc wrong or perhaps I'm missing something the doc is trying to say? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12650) No means to specify Xmx settings for SparkSubmit in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110964#comment-15110964 ] Sean Owen commented on SPARK-12650: --- [~vanzin] is that the intended way to set this? if so it sounds like that's the resolution. > No means to specify Xmx settings for SparkSubmit in yarn-cluster mode > - > > Key: SPARK-12650 > URL: https://issues.apache.org/jira/browse/SPARK-12650 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.5.2 > Environment: Hadoop 2.6.0 >Reporter: John Vines > > Background- > I have an app master designed to do some work and then launch a spark job. > Issue- > If I use yarn-cluster, then the SparkSubmit does not Xmx itself at all, > leading to the jvm taking a default heap which is relatively large. This > causes a large amount of vmem to be taken, so that it is killed by yarn. This > can be worked around by disabling Yarn's vmem check, but that is a hack. > If I run it in yarn-client mode, it's fine as long as my container has enough > space for the driver, which is manageable. But I feel that the utter lack of > Xmx settings for what I believe is a very small jvm is a problem. > I believe this was introduced with the fix for SPARK-3884 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12843) Spark should avoid scanning all partitions when limit is set
[ https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110836#comment-15110836 ] Maciej BryĆski commented on SPARK-12843: [~dileep] I think you miss the point of this Jira. > Spark should avoid scanning all partitions when limit is set > > > Key: SPARK-12843 > URL: https://issues.apache.org/jira/browse/SPARK-12843 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej BryĆski > > SQL Query: > {code} > select * from table limit 100 > {code} > force Spark to scan all partition even when data are available on the > beginning of scan. > This behaviour should be avoided and scan should stop when enough data is > collected. > Is it related to: [SPARK-9850] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster
[ https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110853#comment-15110853 ] Sean Owen commented on SPARK-5629: -- Are all of the EC2 tickets becoming essentially "wont fix" as the support moves out of Spark itself? > Add spark-ec2 action to return info about an existing cluster > - > > Key: SPARK-5629 > URL: https://issues.apache.org/jira/browse/SPARK-5629 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Priority: Minor > > You can launch multiple clusters using spark-ec2. At some point, you might > just want to get some information about an existing cluster. > Use cases include: > * Wanting to check something about your cluster in the EC2 web console. > * Wanting to feed information about your cluster to another tool (e.g. as > described in [SPARK-5627]). > So, in addition to the [existing > actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]: > * {{launch}} > * {{destroy}} > * {{login}} > * {{stop}} > * {{start}} > * {{get-master}} > * {{reboot-slaves}} > We add a new action, {{describe}}, which describes an existing cluster if > given a cluster name, and all clusters if not. > Some examples: > {code} > # describes all clusters launched by spark-ec2 > spark-ec2 describe > {code} > {code} > # describes cluster-1 > spark-ec2 describe cluster-1 > {code} > In combination with the proposal in [SPARK-5627]: > {code} > # describes cluster-3 in a machine-readable way (e.g. JSON) > spark-ec2 describe cluster-3 --machine-readable > {code} > Parallels in similar tools include: > * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju > * [{{starcluster > listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node] > from MIT StarCluster -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6009) IllegalArgumentException thrown by TimSort when SQL ORDER BY RAND ()
[ https://issues.apache.org/jira/browse/SPARK-6009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6009. -- Resolution: Duplicate > IllegalArgumentException thrown by TimSort when SQL ORDER BY RAND () > > > Key: SPARK-6009 > URL: https://issues.apache.org/jira/browse/SPARK-6009 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.0 > Environment: Centos 7, Hadoop 2.6.0, Hive 0.15.0 > java version "1.7.0_75" > OpenJDK Runtime Environment (rhel-2.5.4.2.el7_0-x86_64 u75-b13) > OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode) >Reporter: Paul Barber > > Running the following SparkSQL query over JDBC: > {noformat} >SELECT * > FROM FAA > WHERE Year >= 1998 AND Year <= 1999 > ORDER BY RAND () LIMIT 10 > {noformat} > This results in one or more workers throwing the following exception, with > variations for {{mergeLo}} and {{mergeHi}}. > {noformat} > :java.lang.IllegalArgumentException: Comparison method violates its > general contract! > - at java.util.TimSort.mergeHi(TimSort.java:868) > - at java.util.TimSort.mergeAt(TimSort.java:485) > - at java.util.TimSort.mergeCollapse(TimSort.java:410) > - at java.util.TimSort.sort(TimSort.java:214) > - at java.util.Arrays.sort(Arrays.java:727) > - at > org.spark-project.guava.common.collect.Ordering.leastOf(Ordering.java:708) > - at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37) > - at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1138) > - at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1135) > - at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > - at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601) > - at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > - at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263) > - at org.apache.spark.rdd.RDD.iterator(RDD.scala:230) > - at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > - at org.apache.spark.scheduler.Task.run(Task.scala:56) > - at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) > - at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > - at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > - at java.lang.Thread.run(Thread.java:745) > {noformat} > We have tested with both Spark 1.2.0 and Spark 1.2.1 and have seen the same > error in both. The query sometimes succeeds, but fails more often than not. > Whilst this sounds similar to bugs 3032 and 3656, we believe it it is not the > same. > The {{ORDER BY RAND ()}} is using TimSort to produce the random ordering by > sorting a list of random values. Having spent some time looking at the issue > with jdb, it appears that the problem is triggered by the random values being > changed during the sort - the code which triggers this is in > {{sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Row.scala}} > - class RowOrdering, function compare, line 250 - where a new random number > is taken for the same row. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4171) StreamingContext.actorStream throws serializationError
[ https://issues.apache.org/jira/browse/SPARK-4171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4171. -- Resolution: Won't Fix I think this is obsolete now that the Akka actor bits are being removed. > StreamingContext.actorStream throws serializationError > -- > > Key: SPARK-4171 > URL: https://issues.apache.org/jira/browse/SPARK-4171 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0, 1.2.0 >Reporter: Shiti Saxena > > I encountered this issue when I was working on > https://issues.apache.org/jira/browse/SPARK-3872. > Running the following test case on v1.1.0 and the master > branch(v1.2.0-SNAPSHOT) throws a serialization error. > {noformat} > test("actor input stream") { > // Set up the streaming context and input streams > val ssc = new StreamingContext(conf, batchDuration) > val networkStream = ssc.actorStream[String](EchoActor.props, "TestActor", > // Had to pass the local value of port to prevent from closing over > entire scope > StorageLevel.MEMORY_AND_DISK) > println("created actor") > networkStream.print() > ssc.start() > Thread.sleep(3 * 1000) > println("started stream") > Thread.sleep(3*1000) > logInfo("Stopping server") > logInfo("Stopping context") > ssc.stop() > } > {noformat} > where EchoActor is defined as > {noformat} > class EchoActor extends Actor with ActorHelper { > override def receive = { > case message => sender ! message > } > } > object EchoActor { > def props: Props = Props(new EchoActor()) > } > {noformat} > The same code works with v1.0.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12955) Spark-HiveSQL: It fail when is quering a nested structure
Gerardo Villarroel created SPARK-12955: -- Summary: Spark-HiveSQL: It fail when is quering a nested structure Key: SPARK-12955 URL: https://issues.apache.org/jira/browse/SPARK-12955 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 1.3.0 Environment: CDH-5.4.7-1.cdh5.4.7 Reporter: Gerardo Villarroel If you create a hive table using a structure with a nested structure and spark query it, it get fail. you'll test with a structure similar to: { a : { a { aa1: "subValue", aa2: 1 } } , b : "value" } a extended example is bellow: { "namespace": "AA", "type": "record", "name": "CreditRecord", "fields": [ { "doc": "string", "type": "string", "name": "aaa" }, { "doc": "string", "type": "string", "name": "bbb" }, { "doc": "string", "type": "string", "name": "l90" }, { "doc": "boolean, ", "type": ["boolean", "null" ], "name": "isSubjectDeceased" }, { "doc": "string", "type": "string", "name": "gender" }, { "doc": "string", "type": "string", "name": "dateOfBirth" }, { "doc": "array of Trade, ", "type": [ { "items": { "fields": [ { "doc": "int, ", "type": [ "int", "null" ], "name": "sequenceNumber" }, { "doc": "string", "type": "string", "name": "tradeKey" }, { "doc": "PHR,", "type": [ { "fields": [ { "doc": "string", "type": "string", "name": "rate" }, { "doc": "string", "type": "string", "name": "date" } ], "type": "record", "name": "PHR" }, "null" ], "name": "phrgt48DR" } ], "type": "record", "name": "Trade" }, "type": "array" }, "null" ], "name": "trades" } ] } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5629) Add spark-ec2 action to return info about an existing cluster
[ https://issues.apache.org/jira/browse/SPARK-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111056#comment-15111056 ] Shivaram Venkataraman commented on SPARK-5629: -- Yes - though I think its beneficial to see if the ticket is still valid and if it is, we can open a corresponding issue at github.com/amplab/spark-ec2. Then we can leave a marker here saying where this issue is being followed up at. > Add spark-ec2 action to return info about an existing cluster > - > > Key: SPARK-5629 > URL: https://issues.apache.org/jira/browse/SPARK-5629 > Project: Spark > Issue Type: Improvement > Components: EC2 >Reporter: Nicholas Chammas >Priority: Minor > > You can launch multiple clusters using spark-ec2. At some point, you might > just want to get some information about an existing cluster. > Use cases include: > * Wanting to check something about your cluster in the EC2 web console. > * Wanting to feed information about your cluster to another tool (e.g. as > described in [SPARK-5627]). > So, in addition to the [existing > actions|https://github.com/apache/spark/blob/9b746f380869b54d673e3758ca5e4475f76c864a/ec2/spark_ec2.py#L115]: > * {{launch}} > * {{destroy}} > * {{login}} > * {{stop}} > * {{start}} > * {{get-master}} > * {{reboot-slaves}} > We add a new action, {{describe}}, which describes an existing cluster if > given a cluster name, and all clusters if not. > Some examples: > {code} > # describes all clusters launched by spark-ec2 > spark-ec2 describe > {code} > {code} > # describes cluster-1 > spark-ec2 describe cluster-1 > {code} > In combination with the proposal in [SPARK-5627]: > {code} > # describes cluster-3 in a machine-readable way (e.g. JSON) > spark-ec2 describe cluster-3 --machine-readable > {code} > Parallels in similar tools include: > * [{{juju status}}|https://juju.ubuntu.com/docs/] from Ubuntu Juju > * [{{starcluster > listclusters}}|http://star.mit.edu/cluster/docs/latest/manual/getting_started.html?highlight=listclusters#logging-into-a-worker-node] > from MIT StarCluster -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1680) Clean up use of setExecutorEnvs in SparkConf
[ https://issues.apache.org/jira/browse/SPARK-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110979#comment-15110979 ] Apache Spark commented on SPARK-1680: - User 'weineran' has created a pull request for this issue: https://github.com/apache/spark/pull/10869 > Clean up use of setExecutorEnvs in SparkConf > - > > Key: SPARK-1680 > URL: https://issues.apache.org/jira/browse/SPARK-1680 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Thomas Graves >Priority: Blocker > Fix For: 1.1.0 > > > We should make this consistent between YARN and Standalone. Basically, YARN > mode should just use the executorEnvs from the Spark conf and not need > SPARK_YARN_USER_ENV. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110994#comment-15110994 ] Mario Briggs commented on SPARK-12177: -- bq. If one uses the kafka v9 jar even when using the old consumer API, it can only work a Kafka v9 broker. I tried it on a single system setup (v0.9 client talking to v0.8 server-side) and the consumers had a problem (old or new). The producers though worked fine. So you are right. So then we will have kafka-assembly and kafka-assembly-v09/new and each including their version of kafka jars respectively right? ( I guess now, you were all along thinking 2 diff assemblies, and i guessed the other way round. Duh, IRC might have been faster) With the above confirmed, it automatically throws out 'The public API signatures (of KafkaUtils in v0.9 subproject) are different and do not clash (with KafkaUtils in original kafka subproject) and hence can be added to the existing (original kafka subproject) KafkaUtils class.â So the only thing left it seems is to use 'new' or a better term instead of 'v09', since we both agree on that. Great and thanks Mark. How's the 'python/pyspark/streaming/kafka-v09(new).py' going > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12957) Derive and propagate data constrains in logical plan
Yin Huai created SPARK-12957: Summary: Derive and propagate data constrains in logical plan Key: SPARK-12957 URL: https://issues.apache.org/jira/browse/SPARK-12957 Project: Spark Issue Type: New Feature Components: SQL Reporter: Yin Huai Based on the semantic of a query plan, we can derive data constrains (e.g. if a filter defines {{a > 10}}, we know that the output data of this filter satisfy the constrain of {{a > 10}} and {{a is not null}}). We should build a framework to derive and propagate constrains in the logical plan, which can help us to build more advanced optimizations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12650) No means to specify Xmx settings for SparkSubmit in yarn-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111024#comment-15111024 ] Marcelo Vanzin commented on SPARK-12650: If it works it's a workaround; that's a pretty obscure legacy deprecated option, and we should have a more explicit alternative to it. > No means to specify Xmx settings for SparkSubmit in yarn-cluster mode > - > > Key: SPARK-12650 > URL: https://issues.apache.org/jira/browse/SPARK-12650 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.5.2 > Environment: Hadoop 2.6.0 >Reporter: John Vines > > Background- > I have an app master designed to do some work and then launch a spark job. > Issue- > If I use yarn-cluster, then the SparkSubmit does not Xmx itself at all, > leading to the jvm taking a default heap which is relatively large. This > causes a large amount of vmem to be taken, so that it is killed by yarn. This > can be worked around by disabling Yarn's vmem check, but that is a hack. > If I run it in yarn-client mode, it's fine as long as my container has enough > space for the driver, which is manageable. But I feel that the utter lack of > Xmx settings for what I believe is a very small jvm is a problem. > I believe this was introduced with the fix for SPARK-3884 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project
[ https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111044#comment-15111044 ] Dan Dutrow commented on SPARK-11045: +1 to Dibyendu's comment that "Being at spark-packages, many [people do] not even consider using it [and] use the whatever Receiver Based model which is documented with Spark." Having hit limitations in both of the Receiver and Direct APIs, it would have been nice to have been pointed to the availability of alternatives. > Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to > Apache Spark Project > > > Key: SPARK-11045 > URL: https://issues.apache.org/jira/browse/SPARK-11045 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Dibyendu Bhattacharya > > This JIRA is to track the progress of making the Receiver based Low Level > Kafka Consumer from spark-packages > (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be > contributed back to Apache Spark Project. > This Kafka consumer has been around for more than year and has matured over > the time . I see there are many adoptions of this package . I receive > positive feedbacks that this consumer gives better performance and fault > tolerant capabilities. > This is the primary intent of this JIRA to give community a better > alternative if they want to use Receiver Base model. > If this consumer make it to Spark Core, it will definitely see more adoption > and support from community and help many who still prefer the Receiver Based > model of Kafka Consumer. > I understand the Direct Stream is the consumer which can give Exact Once > semantics and uses Kafka Low Level API , which is good . But Direct Stream > has concerns around recovering checkpoint on driver code change . Application > developer need to manage their own offset which complex . Even if some one > does manages their own offset , it limits the parallelism Spark Streaming can > achieve. If someone wants more parallelism and want > spark.streaming.concurrentJobs more than 1 , you can no longer rely on > storing offset externally as you have no control which batch will run in > which sequence. > Furthermore , the Direct Stream has higher latency , as it fetch messages > form Kafka during RDD action . Also number of RDD partitions are limited to > topic partition . So unless your Kafka topic does not have enough partitions, > you have limited parallelism while RDD processing. > Due to above mentioned concerns , many people who does not want Exactly Once > semantics , still prefer Receiver based model. Unfortunately, when customer > fall back to KafkaUtil.CreateStream approach, which use Kafka High Level > Consumer, there are other issues around the reliability of Kafka High Level > API. Kafka High Level API is buggy and has serious issue around Consumer > Re-balance. Hence I do not think this is correct to advice people to use > KafkaUtil.CreateStream in production . > The better option presently is there is to use the Consumer from > spark-packages . It is is using Kafka Low Level Consumer API , store offset > in Zookeeper, and can recover from any failure . Below are few highlights of > this consumer .. > 1. It has a inbuilt PID Controller for dynamic rate limiting. > 2. In this consumer , The Rate Limiting is done by modifying the size blocks > by controlling the size of messages pulled from Kafka. Whereas , in Spark the > Rate Limiting is done by controlling number of messages. The issue with > throttling by number of message is, if message size various, block size will > also vary . Let say your Kafka has messages for different sizes from 10KB to > 500 KB. Thus throttling by number of message can never give any deterministic > size of your block hence there is no guarantee that Memory Back-Pressure can > really take affect. > 3. This consumer is using Kafka low level API which gives better performance > than KafkaUtils.createStream based High Level API. > 4. This consumer can give end to end no data loss channel if enabled with WAL. > By accepting this low level kafka consumer from spark packages to apache > spark project , we will give community a better options for Kafka > connectivity both for Receiver less and Receiver based model. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10911) Executors should System.exit on clean shutdown
[ https://issues.apache.org/jira/browse/SPARK-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10911: Assignee: Zhuo Liu (was: Apache Spark) > Executors should System.exit on clean shutdown > -- > > Key: SPARK-10911 > URL: https://issues.apache.org/jira/browse/SPARK-10911 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Zhuo Liu >Priority: Minor > > Executors should call System.exit on clean shutdown to make sure all user > threads exit and jvm shuts down. > We ran into a case where an Executor was left around for days trying to > shutdown because the user code was using a non-daemon thread pool and one of > those threads wasn't exiting. We should force the jvm to go away with > System.exit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project
[ https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111206#comment-15111206 ] Dibyendu Bhattacharya commented on SPARK-11045: --- Thanks Dan for your comments . Same thoughts many has told to me as well , and if you see large number of people has voted for this consumer to be included in Spark Core. I wish all who voted for this and using the same should also comment about their opinion . Unfortunately Spark Comitters think otherwise. Spark still document faulty Receiver based model in their website which has issues , and there are many who need alternatives of Direct Stream but reluctant to use spark-packages library and go ahead and use what ever is mentioned in Spark website. This seems to me misguiding people and forcing them to use a buggy consumer despite a better alternatives exists. > Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to > Apache Spark Project > > > Key: SPARK-11045 > URL: https://issues.apache.org/jira/browse/SPARK-11045 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Dibyendu Bhattacharya > > This JIRA is to track the progress of making the Receiver based Low Level > Kafka Consumer from spark-packages > (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be > contributed back to Apache Spark Project. > This Kafka consumer has been around for more than year and has matured over > the time . I see there are many adoptions of this package . I receive > positive feedbacks that this consumer gives better performance and fault > tolerant capabilities. > This is the primary intent of this JIRA to give community a better > alternative if they want to use Receiver Base model. > If this consumer make it to Spark Core, it will definitely see more adoption > and support from community and help many who still prefer the Receiver Based > model of Kafka Consumer. > I understand the Direct Stream is the consumer which can give Exact Once > semantics and uses Kafka Low Level API , which is good . But Direct Stream > has concerns around recovering checkpoint on driver code change . Application > developer need to manage their own offset which complex . Even if some one > does manages their own offset , it limits the parallelism Spark Streaming can > achieve. If someone wants more parallelism and want > spark.streaming.concurrentJobs more than 1 , you can no longer rely on > storing offset externally as you have no control which batch will run in > which sequence. > Furthermore , the Direct Stream has higher latency , as it fetch messages > form Kafka during RDD action . Also number of RDD partitions are limited to > topic partition . So unless your Kafka topic does not have enough partitions, > you have limited parallelism while RDD processing. > Due to above mentioned concerns , many people who does not want Exactly Once > semantics , still prefer Receiver based model. Unfortunately, when customer > fall back to KafkaUtil.CreateStream approach, which use Kafka High Level > Consumer, there are other issues around the reliability of Kafka High Level > API. Kafka High Level API is buggy and has serious issue around Consumer > Re-balance. Hence I do not think this is correct to advice people to use > KafkaUtil.CreateStream in production . > The better option presently is there is to use the Consumer from > spark-packages . It is is using Kafka Low Level Consumer API , store offset > in Zookeeper, and can recover from any failure . Below are few highlights of > this consumer .. > 1. It has a inbuilt PID Controller for dynamic rate limiting. > 2. In this consumer , The Rate Limiting is done by modifying the size blocks > by controlling the size of messages pulled from Kafka. Whereas , in Spark the > Rate Limiting is done by controlling number of messages. The issue with > throttling by number of message is, if message size various, block size will > also vary . Let say your Kafka has messages for different sizes from 10KB to > 500 KB. Thus throttling by number of message can never give any deterministic > size of your block hence there is no guarantee that Memory Back-Pressure can > really take affect. > 3. This consumer is using Kafka low level API which gives better performance > than KafkaUtils.createStream based High Level API. > 4. This consumer can give end to end no data loss channel if enabled with WAL. > By accepting this low level kafka consumer from spark packages to apache > spark project , we will give community a better options for Kafka > connectivity both for Receiver less and Receiver based model. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-12797) Aggregation without grouping keys
[ https://issues.apache.org/jira/browse/SPARK-12797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12797: -- Assignee: Davies Liu > Aggregation without grouping keys > - > > Key: SPARK-12797 > URL: https://issues.apache.org/jira/browse/SPARK-12797 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12953) RDDRelation write set mode will be better to avoid error "pair.parquet already exists"
[ https://issues.apache.org/jira/browse/SPARK-12953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12953: -- Priority: Minor (was: Major) Fix Version/s: (was: 1.6.1) Issue Type: Improvement (was: Wish) [~shijinkui] don't set Fix version > RDDRelation write set mode will be better to avoid error "pair.parquet > already exists" > -- > > Key: SPARK-12953 > URL: https://issues.apache.org/jira/browse/SPARK-12953 > Project: Spark > Issue Type: Improvement > Components: Examples >Reporter: shijinkui >Priority: Minor > > It will be error if not set Write Mode when execute test case > `RDDRelation.main()` > Exception in thread "main" org.apache.spark.sql.AnalysisException: path > file:/Users/sjk/pair.parquet already exists.; > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:76) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:329) > at net.pusuo.gs.sql.RDDRelation$.main(RDDRelation.scala:65) > at net.pusuo.gs.sql.RDDRelation.main(RDDRelation.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12945) ERROR LiveListenerBus: Listener JobProgressListener threw an exception
[ https://issues.apache.org/jira/browse/SPARK-12945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12945: -- Component/s: Web UI > ERROR LiveListenerBus: Listener JobProgressListener threw an exception > -- > > Key: SPARK-12945 > URL: https://issues.apache.org/jira/browse/SPARK-12945 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 > Environment: Linux, yarn-client >Reporter: Tristan >Priority: Minor > > Seeing this a lot; not sure if it is a problem or spurious error (I recall > this was an ignorable issue in previous version). The UI seems to be working > fine: > ERROR LiveListenerBus: Listener JobProgressListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:361) > at > org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:360) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) > at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) > at > org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:360) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) > at > org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6034) DESCRIBE EXTENDED viewname is not supported for HiveContext
[ https://issues.apache.org/jira/browse/SPARK-6034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6034. -- Resolution: Won't Fix > DESCRIBE EXTENDED viewname is not supported for HiveContext > --- > > Key: SPARK-6034 > URL: https://issues.apache.org/jira/browse/SPARK-6034 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Hao > > In HiveContext, describe extended [table | view | cache | datasource table] > is not well supported -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9282) Filter on Spark DataFrame with multiple columns
[ https://issues.apache.org/jira/browse/SPARK-9282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9282. -- Resolution: Not A Problem > Filter on Spark DataFrame with multiple columns > --- > > Key: SPARK-9282 > URL: https://issues.apache.org/jira/browse/SPARK-9282 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell, SQL >Affects Versions: 1.3.0 > Environment: CDH 5.0 on CentOS6 >Reporter: Sandeep Pal > > Filter on dataframe does not work if we have more than one column inside the > filter. Nonetheless, it works on an RDD. > Following is the example: > df1.show() > age coolid depid empname > 23 7 1 sandeep > 21 8 2 john > 24 9 1 cena > 45 12 3 bob > 20 7 4 tanay > 12 8 5 gaurav > df1.filter(df1.age > 21 and df1.age < 45).show(10) > 23 7 1 sandeep > 21 8 2 john <- > 24 9 1 cena > 20 7 4 tanay <- > 12 8 5 gaurav <-- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project
[ https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111243#comment-15111243 ] Saisai Shao commented on SPARK-11045: - Hi [~dibbhatt], I'm afraid I could not agree with your comment about current receiver-based Kafka connector. Currently Spark Streaming uses the high-level API to fetch data, beside using Kafka provided API, spark itself is quite simple, if you're saying this approach is buggy and faulty, I think it is buggy and faulty in Kafka high-level API, not the way Spark Streaming uses it. What you're doing is to fix such things which should be fixed by Kafka not Spark. Yes we could improve the current receiver-based way, but to build up from ground to fix such issues which will also be improved in Kafka will make it diverge from the Kafka's way. Just my two cents :). > Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to > Apache Spark Project > > > Key: SPARK-11045 > URL: https://issues.apache.org/jira/browse/SPARK-11045 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Dibyendu Bhattacharya > > This JIRA is to track the progress of making the Receiver based Low Level > Kafka Consumer from spark-packages > (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be > contributed back to Apache Spark Project. > This Kafka consumer has been around for more than year and has matured over > the time . I see there are many adoptions of this package . I receive > positive feedbacks that this consumer gives better performance and fault > tolerant capabilities. > This is the primary intent of this JIRA to give community a better > alternative if they want to use Receiver Base model. > If this consumer make it to Spark Core, it will definitely see more adoption > and support from community and help many who still prefer the Receiver Based > model of Kafka Consumer. > I understand the Direct Stream is the consumer which can give Exact Once > semantics and uses Kafka Low Level API , which is good . But Direct Stream > has concerns around recovering checkpoint on driver code change . Application > developer need to manage their own offset which complex . Even if some one > does manages their own offset , it limits the parallelism Spark Streaming can > achieve. If someone wants more parallelism and want > spark.streaming.concurrentJobs more than 1 , you can no longer rely on > storing offset externally as you have no control which batch will run in > which sequence. > Furthermore , the Direct Stream has higher latency , as it fetch messages > form Kafka during RDD action . Also number of RDD partitions are limited to > topic partition . So unless your Kafka topic does not have enough partitions, > you have limited parallelism while RDD processing. > Due to above mentioned concerns , many people who does not want Exactly Once > semantics , still prefer Receiver based model. Unfortunately, when customer > fall back to KafkaUtil.CreateStream approach, which use Kafka High Level > Consumer, there are other issues around the reliability of Kafka High Level > API. Kafka High Level API is buggy and has serious issue around Consumer > Re-balance. Hence I do not think this is correct to advice people to use > KafkaUtil.CreateStream in production . > The better option presently is there is to use the Consumer from > spark-packages . It is is using Kafka Low Level Consumer API , store offset > in Zookeeper, and can recover from any failure . Below are few highlights of > this consumer .. > 1. It has a inbuilt PID Controller for dynamic rate limiting. > 2. In this consumer , The Rate Limiting is done by modifying the size blocks > by controlling the size of messages pulled from Kafka. Whereas , in Spark the > Rate Limiting is done by controlling number of messages. The issue with > throttling by number of message is, if message size various, block size will > also vary . Let say your Kafka has messages for different sizes from 10KB to > 500 KB. Thus throttling by number of message can never give any deterministic > size of your block hence there is no guarantee that Memory Back-Pressure can > really take affect. > 3. This consumer is using Kafka low level API which gives better performance > than KafkaUtils.createStream based High Level API. > 4. This consumer can give end to end no data loss channel if enabled with WAL. > By accepting this low level kafka consumer from spark packages to apache > spark project , we will give community a better options for Kafka > connectivity both for Receiver less and Receiver based model. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (SPARK-10498) Add requirements file for create dev python tools
[ https://issues.apache.org/jira/browse/SPARK-10498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10498: Assignee: (was: Apache Spark) > Add requirements file for create dev python tools > - > > Key: SPARK-10498 > URL: https://issues.apache.org/jira/browse/SPARK-10498 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: holdenk >Priority: Minor > > Minor since so few people use them, but it would probably be good to have a > requirements file for our python release tools for easier setup (also version > pinning). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10498) Add requirements file for create dev python tools
[ https://issues.apache.org/jira/browse/SPARK-10498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111527#comment-15111527 ] Apache Spark commented on SPARK-10498: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/10871 > Add requirements file for create dev python tools > - > > Key: SPARK-10498 > URL: https://issues.apache.org/jira/browse/SPARK-10498 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: holdenk >Priority: Minor > > Minor since so few people use them, but it would probably be good to have a > requirements file for our python release tools for easier setup (also version > pinning). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10498) Add requirements file for create dev python tools
[ https://issues.apache.org/jira/browse/SPARK-10498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10498: Assignee: Apache Spark > Add requirements file for create dev python tools > - > > Key: SPARK-10498 > URL: https://issues.apache.org/jira/browse/SPARK-10498 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: holdenk >Assignee: Apache Spark >Priority: Minor > > Minor since so few people use them, but it would probably be good to have a > requirements file for our python release tools for easier setup (also version > pinning). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12960) Some examples are missing support for python2
Mark Grover created SPARK-12960: --- Summary: Some examples are missing support for python2 Key: SPARK-12960 URL: https://issues.apache.org/jira/browse/SPARK-12960 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.6.0 Reporter: Mark Grover Priority: Minor Without importing the print_function, the lines later on like {code} print("Usage: direct_kafka_wordcount.py ", file=sys.stderr) {code} fail when using python2.*. Import fixes that problem and doesn't break anything on python3 either. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11045) Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to Apache Spark Project
[ https://issues.apache.org/jira/browse/SPARK-11045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111398#comment-15111398 ] Cody Koeninger commented on SPARK-11045: There's already work being done on 0.9 https://issues.apache.org/jira/browse/SPARK-12177 On Thu, Jan 21, 2016 at 3:08 PM, Marko Bonaci (JIRA)> Contributing Receiver based Low Level Kafka Consumer from Spark-Packages to > Apache Spark Project > > > Key: SPARK-11045 > URL: https://issues.apache.org/jira/browse/SPARK-11045 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Dibyendu Bhattacharya > > This JIRA is to track the progress of making the Receiver based Low Level > Kafka Consumer from spark-packages > (http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) to be > contributed back to Apache Spark Project. > This Kafka consumer has been around for more than year and has matured over > the time . I see there are many adoptions of this package . I receive > positive feedbacks that this consumer gives better performance and fault > tolerant capabilities. > This is the primary intent of this JIRA to give community a better > alternative if they want to use Receiver Base model. > If this consumer make it to Spark Core, it will definitely see more adoption > and support from community and help many who still prefer the Receiver Based > model of Kafka Consumer. > I understand the Direct Stream is the consumer which can give Exact Once > semantics and uses Kafka Low Level API , which is good . But Direct Stream > has concerns around recovering checkpoint on driver code change . Application > developer need to manage their own offset which complex . Even if some one > does manages their own offset , it limits the parallelism Spark Streaming can > achieve. If someone wants more parallelism and want > spark.streaming.concurrentJobs more than 1 , you can no longer rely on > storing offset externally as you have no control which batch will run in > which sequence. > Furthermore , the Direct Stream has higher latency , as it fetch messages > form Kafka during RDD action . Also number of RDD partitions are limited to > topic partition . So unless your Kafka topic does not have enough partitions, > you have limited parallelism while RDD processing. > Due to above mentioned concerns , many people who does not want Exactly Once > semantics , still prefer Receiver based model. Unfortunately, when customer > fall back to KafkaUtil.CreateStream approach, which use Kafka High Level > Consumer, there are other issues around the reliability of Kafka High Level > API. Kafka High Level API is buggy and has serious issue around Consumer > Re-balance. Hence I do not think this is correct to advice people to use > KafkaUtil.CreateStream in production . > The better option presently is there is to use the Consumer from > spark-packages . It is is using Kafka Low Level Consumer API , store offset > in Zookeeper, and can recover from any failure . Below are few highlights of > this consumer .. > 1. It has a inbuilt PID Controller for dynamic rate limiting. > 2. In this consumer , The Rate Limiting is done by modifying the size blocks > by controlling the size of messages pulled from Kafka. Whereas , in Spark the > Rate Limiting is done by controlling number of messages. The issue with > throttling by number of message is, if message size various, block size will > also vary . Let say your Kafka has messages for different sizes from 10KB to > 500 KB. Thus throttling by number of message can never give any deterministic > size of your block hence there is no guarantee that Memory Back-Pressure can > really take affect. > 3. This consumer is using Kafka low level API which gives better performance > than KafkaUtils.createStream based High Level API. > 4. This consumer can give end to end no data loss channel if enabled with WAL. > By accepting this low level kafka consumer from spark packages to apache > spark project , we will give community a better options for Kafka > connectivity both for Receiver less and Receiver based model. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12946) The SQL page is empty
[ https://issues.apache.org/jira/browse/SPARK-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111598#comment-15111598 ] Alex Bozarth commented on SPARK-12946: -- Can you give more details on this? Such as what commands did you run in the shell, what are your spark/yarn/hive configs, and how did you access the Web UI? > The SQL page is empty > - > > Key: SPARK-12946 > URL: https://issues.apache.org/jira/browse/SPARK-12946 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: KaiXinXIaoLei > Attachments: SQLpage.png > > > I run sql query using "bin/spark-sql --master yarn". Then i open the ui , > and find the SQL page is empty -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12957) Derive and propagate data constrains in logical plan
[ https://issues.apache.org/jira/browse/SPARK-12957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111597#comment-15111597 ] Xiao Li commented on SPARK-12957: - I have two related PRs that require a general null filtering function: https://github.com/apache/spark/pull/10567 and https://github.com/apache/spark/pull/10566. Could you share your opinions how to make it more general? > Derive and propagate data constrains in logical plan > - > > Key: SPARK-12957 > URL: https://issues.apache.org/jira/browse/SPARK-12957 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai > > Based on the semantic of a query plan, we can derive data constrains (e.g. if > a filter defines {{a > 10}}, we know that the output data of this filter > satisfy the constrain of {{a > 10}} and {{a is not null}}). We should build a > framework to derive and propagate constrains in the logical plan, which can > help us to build more advanced optimizations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12946) The SQL page is empty
[ https://issues.apache.org/jira/browse/SPARK-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111645#comment-15111645 ] Alex Bozarth commented on SPARK-12946: -- These may be the same problem > The SQL page is empty > - > > Key: SPARK-12946 > URL: https://issues.apache.org/jira/browse/SPARK-12946 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: KaiXinXIaoLei > Attachments: SQLpage.png > > > I run sql query using "bin/spark-sql --master yarn". Then i open the ui , > and find the SQL page is empty -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12946) The SQL page is empty
[ https://issues.apache.org/jira/browse/SPARK-12946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Bozarth updated SPARK-12946: - Comment: was deleted (was: These may be the same problem) > The SQL page is empty > - > > Key: SPARK-12946 > URL: https://issues.apache.org/jira/browse/SPARK-12946 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: KaiXinXIaoLei > Attachments: SQLpage.png > > > I run sql query using "bin/spark-sql --master yarn". Then i open the ui , > and find the SQL page is empty -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12859) Names of input streams with receivers don't fit in Streaming page
[ https://issues.apache.org/jira/browse/SPARK-12859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12859: Assignee: (was: Apache Spark) > Names of input streams with receivers don't fit in Streaming page > - > > Key: SPARK-12859 > URL: https://issues.apache.org/jira/browse/SPARK-12859 > Project: Spark > Issue Type: Bug > Components: Streaming, Web UI >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Trivial > Attachments: spark-streaming-webui-names-too-long-to-fit.png > > > Since the column for the names of input streams with receivers (under Input > Rate) is fixed, the not-so-long names don't fit in Streaming page. See the > attachment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12859) Names of input streams with receivers don't fit in Streaming page
[ https://issues.apache.org/jira/browse/SPARK-12859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12859: Assignee: Apache Spark > Names of input streams with receivers don't fit in Streaming page > - > > Key: SPARK-12859 > URL: https://issues.apache.org/jira/browse/SPARK-12859 > Project: Spark > Issue Type: Bug > Components: Streaming, Web UI >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Assignee: Apache Spark >Priority: Trivial > Attachments: spark-streaming-webui-names-too-long-to-fit.png > > > Since the column for the names of input streams with receivers (under Input > Rate) is fixed, the not-so-long names don't fit in Streaming page. See the > attachment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12859) Names of input streams with receivers don't fit in Streaming page
[ https://issues.apache.org/jira/browse/SPARK-12859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111660#comment-15111660 ] Apache Spark commented on SPARK-12859: -- User 'ajbozarth' has created a pull request for this issue: https://github.com/apache/spark/pull/10873 > Names of input streams with receivers don't fit in Streaming page > - > > Key: SPARK-12859 > URL: https://issues.apache.org/jira/browse/SPARK-12859 > Project: Spark > Issue Type: Bug > Components: Streaming, Web UI >Affects Versions: 2.0.0 >Reporter: Jacek Laskowski >Priority: Trivial > Attachments: spark-streaming-webui-names-too-long-to-fit.png > > > Since the column for the names of input streams with receivers (under Input > Rate) is fixed, the not-so-long names don't fit in Streaming page. See the > attachment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12959) Silent switch to normal table writing when writing bucketed data with bucketing disabled
Xiao Li created SPARK-12959: --- Summary: Silent switch to normal table writing when writing bucketed data with bucketing disabled Key: SPARK-12959 URL: https://issues.apache.org/jira/browse/SPARK-12959 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li When users turn off bucketing in SQLConf, we should issue some messages to tell users these operations will be converted to normal way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9721) TreeTests.checkEqual should compare predictions on data
[ https://issues.apache.org/jira/browse/SPARK-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111579#comment-15111579 ] Seth Hendrickson commented on SPARK-9721: - I assume there is some motivation behind this JIRA, but given the current tree structure in spark.ml, I cannot think of a case where all splits and node predictions could be equal for two trees, but they do not predict exactly the same values for a given dataset. Is there a specific case this change would protect against? > TreeTests.checkEqual should compare predictions on data > --- > > Key: SPARK-9721 > URL: https://issues.apache.org/jira/browse/SPARK-9721 > Project: Spark > Issue Type: Test > Components: ML >Reporter: Joseph K. Bradley > > In spark.ml tree and ensemble unit tests: > Modify TreeTests.checkEqual in unit tests to compare predictions on the > training data, rather than only comparing the trees themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org