[jira] [Commented] (SPARK-14396) Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)

2016-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227838#comment-15227838
 ] 

Apache Spark commented on SPARK-14396:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/12201

> Throw Exceptions for DDLs of Partitioned Views (CREATE VIEW and ALTER VIEW)
> ---
>
> Key: SPARK-14396
> URL: https://issues.apache.org/jira/browse/SPARK-14396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Because the concept of partitioning is associated with physical tables, we 
> disable all the supports of partitioned views, which are defined in the 
> following three commands in [Hive DDL 
> Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView):
> {noformat}
> ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...];
> ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec;
> CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT 
> column_comment], ...) ]
>   [COMMENT view_comment]
>   [TBLPROPERTIES (property_name = property_value, ...)]
>   AS SELECT ...;
> {noformat}
>  
> An exception is thrown when users issue any of these three DDL commands.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14311) Model persistence in SparkR

2016-04-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227811#comment-15227811
 ] 

Yanbo Liang edited comment on SPARK-14311 at 4/6/16 6:34 AM:
-

When I worked at SPARK-14313, I found that we can easily support ml.save but 
difficultly to support ml.load in the current framework.
Because ml.load will return one type of class rather than different types for 
different models. So we can support ml.load return a S4 class “PipelineModel", 
but it does not meet the new SparkRWrapper framework which wrapper different S4 
class for different MLlib models.
I also found that the R h2o package has the unified H2OModel object for all 
models, and it’s more easy to support save/load. 
http://finzi.psych.upenn.edu/library/h2o/html/h2o.loadModel.html
Should we make ml.load return a unified S4 class “PipelineModel”? If yes, we 
should make refactor which defines different model wrappers as unified S4 
classes “PipelineModel" firstly. But it looks like we fall back to the old 
SparkRWrapper framework.
I’m looking forward to hear your thoughts. [~mengxr] 


was (Author: yanboliang):
When I worked at SPARK-14313, I found that we can easily support ml.save but 
difficultly to support ml.load in the current framework.
Because ml.load will return one type of class rather than different types for 
different models. So we can support ml.load return a S4 class “PipelineModel", 
but it does not meet the new SparkRWrapper framework which wrapper different S4 
class for different MLlib models.
I also found that the R h2o package has the unified H2OModel object for all 
models, and it’s more easy to support save/load. 
http://finzi.psych.upenn.edu/library/h2o/html/h2o.loadModel.html
Should we make ml.load return a unified S4 class “PipelineModel”? If yes, we 
should make refactor which defines different model wrappers as unified S4 
classes “PipelineModel" firstly. I’m looking forward to hear your thoughts. 
[~mengxr] 

> Model persistence in SparkR
> ---
>
> Key: SPARK-14311
> URL: https://issues.apache.org/jira/browse/SPARK-14311
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, 
> naive Bayes, and AFT survival regression. Users can fit models, get summary, 
> and make predictions. However, they cannot save/load the models yet.
> ML models in SparkR are wrappers around ML pipelines. So it should be 
> straightforward to implement model persistence. We need to think more about 
> the API. R uses save/load for objects and datasets (also objects). It is 
> possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But 
> I'm not sure whether load can be overloaded easily. I propose the following 
> API:
> {code}
> model <- glm(formula, data = df)
> ml.save(model, path, mode = "overwrite")
> model2 <- ml.load(path)
> {code}
> We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load 
> is a S3 method (correct me if I'm wrong).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13857) Feature parity for ALS ML with MLLIB

2016-04-05 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227818#comment-15227818
 ] 

Nick Pentreath commented on SPARK-13857:


I was thinking the behaviour of {{transform}} can depend on the input params. 
By default, it takes a DF with cols {{userId}} and {{itemId}}, and outputs 
predictions for each {{user, item}} pair. If {{setK(10).setTopKCol("userId")}} 
is called, then it takes a DF with only one input col {{"userId"}} and outputs 
top-k predictions for each user (and vice-versa for items).

The input DF would be different for each approach (the former is likely to be 
say the "test set" in evaluation for something like RMSE, while the latter is 
the set of userIds for ranking-style evaluation or making the type of 
predictions actually useful in practice).

In fact, the current {{transform}} is in practice only useful for evaluating 
RMSE. If users try to use this method for top-k style predictions it is then 
very inefficient.

I will dig into evaluators a bit and see if there is a way to accommodate 
evaluation using the other methods {{recommendItems}} etc.

> Feature parity for ALS ML with MLLIB
> 
>
> Key: SPARK-13857
> URL: https://issues.apache.org/jira/browse/SPARK-13857
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: Nick Pentreath
>
> Currently {{mllib.recommendation.MatrixFactorizationModel}} has methods 
> {{recommendProducts/recommendUsers}} for recommending top K to a given user / 
> item, as well as {{recommendProductsForUsers/recommendUsersForProducts}} to 
> recommend top K across all users/items.
> Additionally, SPARK-10802 is for adding the ability to do 
> {{recommendProductsForUsers}} for a subset of users (or vice versa).
> Look at exposing or porting (as appropriate) these methods to ALS in ML. 
> Investigate if efficiency can be improved at the same time (see SPARK-11968).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14311) Model persistence in SparkR

2016-04-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227811#comment-15227811
 ] 

Yanbo Liang commented on SPARK-14311:
-

When I worked at SPARK-14313, I found that we can easily support ml.save but 
difficultly to support ml.load in the current framework.
Because ml.load will return one type of class rather than different types for 
different models. So we can support ml.load return a S4 class “PipelineModel", 
but it does not meet the new SparkRWrapper framework which wrapper different S4 
class for different MLlib models.
I also found that the R h2o package has the unified H2OModel object for all 
models, and it’s more easy to support save/load. 
http://finzi.psych.upenn.edu/library/h2o/html/h2o.loadModel.html
Should we make ml.load return a unified S4 class “PipelineModel”? If yes, we 
should make refactor which defines different model wrappers as unified S4 
classes “PipelineModel" firstly. I’m looking forward to hear your thoughts. 
[~mengxr] 

> Model persistence in SparkR
> ---
>
> Key: SPARK-14311
> URL: https://issues.apache.org/jira/browse/SPARK-14311
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, 
> naive Bayes, and AFT survival regression. Users can fit models, get summary, 
> and make predictions. However, they cannot save/load the models yet.
> ML models in SparkR are wrappers around ML pipelines. So it should be 
> straightforward to implement model persistence. We need to think more about 
> the API. R uses save/load for objects and datasets (also objects). It is 
> possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But 
> I'm not sure whether load can be overloaded easily. I propose the following 
> API:
> {code}
> model <- glm(formula, data = df)
> ml.save(model, path, mode = "overwrite")
> model2 <- ml.load(path)
> {code}
> We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load 
> is a S3 method (correct me if I'm wrong).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14392) CountVectorizer Estimator should include binary toggle Param

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14392:


Assignee: Apache Spark

> CountVectorizer Estimator should include binary toggle Param
> 
>
> Key: SPARK-14392
> URL: https://issues.apache.org/jira/browse/SPARK-14392
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> CountVectorizerModel contains a "binary" toggle Param.  The Estimator should 
> contain it as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14392) CountVectorizer Estimator should include binary toggle Param

2016-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227802#comment-15227802
 ] 

Apache Spark commented on SPARK-14392:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/12200

> CountVectorizer Estimator should include binary toggle Param
> 
>
> Key: SPARK-14392
> URL: https://issues.apache.org/jira/browse/SPARK-14392
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> CountVectorizerModel contains a "binary" toggle Param.  The Estimator should 
> contain it as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14392) CountVectorizer Estimator should include binary toggle Param

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14392:


Assignee: (was: Apache Spark)

> CountVectorizer Estimator should include binary toggle Param
> 
>
> Key: SPARK-14392
> URL: https://issues.apache.org/jira/browse/SPARK-14392
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> CountVectorizerModel contains a "binary" toggle Param.  The Estimator should 
> contain it as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14426) Merge PerserUtils and ParseUtils

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14426:


Assignee: (was: Apache Spark)

> Merge PerserUtils and ParseUtils
> 
>
> Key: SPARK-14426
> URL: https://issues.apache.org/jira/browse/SPARK-14426
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> We have ParserUtils and ParseUtils which are both utility collections for use 
> during the parsing process.
> Those name and what they are used for is very similar so I think we can merge 
> them.
> Also, the original unescapeSQLString method may have a fault. When "\u0061" 
> style character literals are passed to the method, it's not unescaped 
> successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14426) Merge PerserUtils and ParseUtils

2016-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227801#comment-15227801
 ] 

Apache Spark commented on SPARK-14426:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/12199

> Merge PerserUtils and ParseUtils
> 
>
> Key: SPARK-14426
> URL: https://issues.apache.org/jira/browse/SPARK-14426
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>
> We have ParserUtils and ParseUtils which are both utility collections for use 
> during the parsing process.
> Those name and what they are used for is very similar so I think we can merge 
> them.
> Also, the original unescapeSQLString method may have a fault. When "\u0061" 
> style character literals are passed to the method, it's not unescaped 
> successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14426) Merge PerserUtils and ParseUtils

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14426:


Assignee: Apache Spark

> Merge PerserUtils and ParseUtils
> 
>
> Key: SPARK-14426
> URL: https://issues.apache.org/jira/browse/SPARK-14426
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>
> We have ParserUtils and ParseUtils which are both utility collections for use 
> during the parsing process.
> Those name and what they are used for is very similar so I think we can merge 
> them.
> Also, the original unescapeSQLString method may have a fault. When "\u0061" 
> style character literals are passed to the method, it's not unescaped 
> successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14412) spark.ml ALS prefered storage level Params

2016-04-05 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227799#comment-15227799
 ] 

Nick Pentreath commented on SPARK-14412:


Sure, go ahead

> spark.ml ALS prefered storage level Params
> --
>
> Key: SPARK-14412
> URL: https://issues.apache.org/jira/browse/SPARK-14412
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> spark.mllib ALS supports {{setIntermediateRDDStorageLevel}} and 
> {{setFinalRDDStorageLevel}}.  Those should be added as Params in spark.ml, 
> but they should be in group "expertParam" since few users will need them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14410) SessionCatalog needs to check function existence

2016-04-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-14410:
-

Assignee: Andrew Or

> SessionCatalog needs to check function existence 
> -
>
> Key: SPARK-14410
> URL: https://issues.apache.org/jira/browse/SPARK-14410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Andrew Or
>
> Right now, operations for an existing functions in SessionCatalog do not 
> really check if the function exists. We should add this check and avoid of 
> doing the check in command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14426) Merge PerserUtils and ParseUtils

2016-04-05 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-14426:
--

 Summary: Merge PerserUtils and ParseUtils
 Key: SPARK-14426
 URL: https://issues.apache.org/jira/browse/SPARK-14426
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Kousuke Saruta


We have ParserUtils and ParseUtils which are both utility collections for use 
during the parsing process.

Those name and what they are used for is very similar so I think we can merge 
them.

Also, the original unescapeSQLString method may have a fault. When "\u0061" 
style character literals are passed to the method, it's not unescaped 
successfully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14410) SessionCatalog needs to check function existence

2016-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227794#comment-15227794
 ] 

Apache Spark commented on SPARK-14410:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/12198

> SessionCatalog needs to check function existence 
> -
>
> Key: SPARK-14410
> URL: https://issues.apache.org/jira/browse/SPARK-14410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Right now, operations for an existing functions in SessionCatalog do not 
> really check if the function exists. We should add this check and avoid of 
> doing the check in command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14425) SQL/dataframe join error: mixes up the columns

2016-04-05 Thread venu k tangirala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venu k tangirala updated SPARK-14425:
-
Description: 
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
{code:none}

>>>f_df.take(3)
[Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:20:39 AM', 
productId=374695023, pageId=2617232),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:07 AM', 
productId=374694787, pageId=2617240),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:25 AM', 
productId=374694787, pageId=2617247)]
{code}
As I am trying to build a recommendation system, and for ALS Ratings user_id 
and product_id have to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for SQL

f_df.registerTempTable("f_df") #doing this only for SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I tried it with both dataFrame join and SQL join, both have the same 
error:

{code:none}
tmp = f_df\
.join(visitorId_toInt, f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
.select(f_df["idSite"], f_df["servertimestamp"], 
visitorId_toInt["int_visitorId"],\
 f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
f_df["pageId"] )
  
ratings_df = 
tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
.select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
 tmp["sessionId"],tmp["serverTimePretty"], 
productId_toInt["int_productId"], tmp["pageId"] )
{code}
The SQL version:

{code:none}
ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
sessionId, serverTimePretty, int_productId,  pageId \
FROM f_df \
INNER JOIN visitorId_toInt \
ON f_df.visitorId = visitorId_toInt.visitorId \
INNER JOIN productId_toInt \
ON f_df.productId = productId_toInt.productId \
")
{code}

Here is the result of the join: 

{code:none}
>>>ratings_df.take(3)
[Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
sessionId=None, serverTimePretty=u'375989339', int_productId=1060, pageId=100)]
{code}

The idSite in my dataset is 100 for all the rows, and some how thats being 
assigned to pageId. 



 

  was:
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
{code:none}

>>>f_df.take(3)
[Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:20:39 AM', 
productId=374695023, pageId=2617232),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:07 AM', 
productId=374694787, pageId=2617240),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:25 AM', 
productId=374694787, pageId=2617247)]
{code}
As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for SQL

f_df.registerTempTable("f_df") #doing this only for SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I tried it with both dataFrame j

[jira] [Commented] (SPARK-14424) spark-class and related (spark-shell, etc.) no longer work with sbt build as documented

2016-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227773#comment-15227773
 ] 

Apache Spark commented on SPARK-14424:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/12197

> spark-class and related (spark-shell, etc.) no longer work with sbt build as 
> documented
> ---
>
> Key: SPARK-14424
> URL: https://issues.apache.org/jira/browse/SPARK-14424
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: holdenk
>
> SPARK-13579 disabled building the assembly artifacts. Our shell scripts 
> (specifically spark-class but everything depends on it) aren't yet ready for 
> this change.
> Repro steps:
> ./build/sbt clean compile assembly && ./bin/spark-shell
> Result:
> "
> Failed to find Spark jars directory (.../assembly/target/scala-2.11/jars).
> You need to build Spark before running this program.
> "
> Follow up fix docs and clarify assembly requirement has been replaced with 
> package 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14425) SQL/dataframe join error: mixes up the columns

2016-04-05 Thread venu k tangirala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venu k tangirala updated SPARK-14425:
-
Description: 
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
{code:none}

>>>f_df.take(3)
[Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:20:39 AM', 
productId=374695023, pageId=2617232),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:07 AM', 
productId=374694787, pageId=2617240),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:25 AM', 
productId=374694787, pageId=2617247)]
{code}
As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for SQL

f_df.registerTempTable("f_df") #doing this only for SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I tried it with both dataFrame join and SQL join, both have the same 
error:

{code:none}
tmp = f_df\
.join(visitorId_toInt, f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
.select(f_df["idSite"], f_df["servertimestamp"], 
visitorId_toInt["int_visitorId"],\
 f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
f_df["pageId"] )
  
ratings_df = 
tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
.select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
 tmp["sessionId"],tmp["serverTimePretty"], 
productId_toInt["int_productId"], tmp["pageId"] )
{code}
The SQL version:

{code:none}
ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
sessionId, serverTimePretty, int_productId,  pageId \
FROM f_df \
INNER JOIN visitorId_toInt \
ON f_df.visitorId = visitorId_toInt.visitorId \
INNER JOIN productId_toInt \
ON f_df.productId = productId_toInt.productId \
")
{code}

Here is the result of the join: 

{code:none}
>>>ratings_df.take(3)
[Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
sessionId=None, serverTimePretty=u'375989339', int_productId=1060, pageId=100)]
{code}

The idSite in my dataset is 100 for all the rows, and some how thats being 
assigned to pageId. 



 

  was:
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
{code:none}

>>>f_df.take(3)
[Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:20:39 AM', 
productId=374695023, pageId=2617232),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:07 AM', 
productId=374694787, pageId=2617240),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:25 AM', 
productId=374694787, pageId=2617247)]
{code}
As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I tried it wi

[jira] [Updated] (SPARK-14425) SQL/dataframe join error: mixes up the columns

2016-04-05 Thread venu k tangirala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venu k tangirala updated SPARK-14425:
-
Summary: SQL/dataframe join error: mixes up the columns   (was: spark 
SQL/dataframe join error: mixes up the columns )

> SQL/dataframe join error: mixes up the columns 
> ---
>
> Key: SPARK-14425
> URL: https://issues.apache.org/jira/browse/SPARK-14425
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.1
> Environment: databricks cloud
>Reporter: venu k tangirala
>Priority: Blocker
>
> I am running this on databricks cloud.
> I am running a join operation and the result has a the columns mixed up. 
> Here is an example:
> the original df:
> {code:none}
> >>>f_df.take(3)
> [Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
> sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:20:39 AM', 
> productId=374695023, pageId=2617232),
>  Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
> sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:07 AM', 
> productId=374694787, pageId=2617240),
>  Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
> sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:25 AM', 
> productId=374694787, pageId=2617247)]
> {code}
> As I am trying to build a recommendation system, and ALS Ratings has to be 
> user_is and product_id has to be int, I am mapping them as follows: 
> {code:none}
> # mapping string to int for visitors
> visitorId_toInt = f_df.map(lambda 
> x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
> # print visitorId_toInt.take(3)
> visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
> SQL
> # mapping long to int for products
> productId_toInt= f_df.map(lambda 
> x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
> # print productId_toInt.take(3)
> productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
> SQL
> f_df.registerTempTable("f_df") #doing this only for the SQL
> {code}
> Now I do the join and get the int versions of user_id and product_id as 
> follows, I tried it with both dataFrame join and SQL join, both have the same 
> error:
> {code:none}
> tmp = f_df\
> .join(visitorId_toInt, 
> f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
> .select(f_df["idSite"], f_df["servertimestamp"], 
> visitorId_toInt["int_visitorId"],\
>  f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
> f_df["pageId"] )
>   
> ratings_df = 
> tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
> .select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
>  tmp["sessionId"],tmp["serverTimePretty"], 
> productId_toInt["int_productId"], tmp["pageId"] )
> {code}
> The SQL version:
> {code:none}
> ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
> sessionId, serverTimePretty, int_productId,  pageId \
> FROM f_df \
> INNER JOIN visitorId_toInt \
> ON f_df.visitorId = visitorId_toInt.visitorId \
> INNER JOIN productId_toInt \
> ON f_df.productId = productId_toInt.productId \
> ")
> {code}
> Here is the result of the join: 
> {code:none}
> >>>ratings_df.take(3)
> [Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
> sessionId=None, serverTimePretty=u'377347895', int_productId=7936, 
> pageId=100),
>  Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
> sessionId=None, serverTimePretty=u'377347895', int_productId=7936, 
> pageId=100),
>  Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
> sessionId=None, serverTimePretty=u'375989339', int_productId=1060, 
> pageId=100)]
> {code}
> The idSite in my dataset is 100 for all the rows, and some how thats being 
> assigned to pageId. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14425) spark SQL/dataframe join error: mixes up the columns

2016-04-05 Thread venu k tangirala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venu k tangirala updated SPARK-14425:
-
Summary: spark SQL/dataframe join error: mixes up the columns   (was: spark 
SQL/dataframe join error: mixes the columns up)

> spark SQL/dataframe join error: mixes up the columns 
> -
>
> Key: SPARK-14425
> URL: https://issues.apache.org/jira/browse/SPARK-14425
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.1
> Environment: databricks cloud
>Reporter: venu k tangirala
>Priority: Blocker
>
> I am running this on databricks cloud.
> I am running a join operation and the result has a the columns mixed up. 
> Here is an example:
> the original df:
> {code:none}
> >>>f_df.take(3)
> [Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
> sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:20:39 AM', 
> productId=374695023, pageId=2617232),
>  Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
> sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:07 AM', 
> productId=374694787, pageId=2617240),
>  Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
> sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:25 AM', 
> productId=374694787, pageId=2617247)]
> {code}
> As I am trying to build a recommendation system, and ALS Ratings has to be 
> user_is and product_id has to be int, I am mapping them as follows: 
> {code:none}
> # mapping string to int for visitors
> visitorId_toInt = f_df.map(lambda 
> x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
> # print visitorId_toInt.take(3)
> visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
> SQL
> # mapping long to int for products
> productId_toInt= f_df.map(lambda 
> x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
> # print productId_toInt.take(3)
> productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
> SQL
> f_df.registerTempTable("f_df") #doing this only for the SQL
> {code}
> Now I do the join and get the int versions of user_id and product_id as 
> follows, I tried it with both dataFrame join and SQL join, both have the same 
> error:
> {code:none}
> tmp = f_df\
> .join(visitorId_toInt, 
> f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
> .select(f_df["idSite"], f_df["servertimestamp"], 
> visitorId_toInt["int_visitorId"],\
>  f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
> f_df["pageId"] )
>   
> ratings_df = 
> tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
> .select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
>  tmp["sessionId"],tmp["serverTimePretty"], 
> productId_toInt["int_productId"], tmp["pageId"] )
> {code}
> The SQL version:
> {code:none}
> ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
> sessionId, serverTimePretty, int_productId,  pageId \
> FROM f_df \
> INNER JOIN visitorId_toInt \
> ON f_df.visitorId = visitorId_toInt.visitorId \
> INNER JOIN productId_toInt \
> ON f_df.productId = productId_toInt.productId \
> ")
> {code}
> Here is the result of the join: 
> {code:none}
> >>>ratings_df.take(3)
> [Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
> sessionId=None, serverTimePretty=u'377347895', int_productId=7936, 
> pageId=100),
>  Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
> sessionId=None, serverTimePretty=u'377347895', int_productId=7936, 
> pageId=100),
>  Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
> sessionId=None, serverTimePretty=u'375989339', int_productId=1060, 
> pageId=100)]
> {code}
> The idSite in my dataset is 100 for all the rows, and some how thats being 
> assigned to pageId. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14425) spark SQL/dataframe join error: mixes the columns up

2016-04-05 Thread venu k tangirala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venu k tangirala updated SPARK-14425:
-
Description: 
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
{code:none}

>>>f_df.take(3)
[Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:20:39 AM', 
productId=374695023, pageId=2617232),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:07 AM', 
productId=374694787, pageId=2617240),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:25 AM', 
productId=374694787, pageId=2617247)]
{code}
As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I tried it with both dataFrame join and SQL join, both have the same 
error:

{code:none}
tmp = f_df\
.join(visitorId_toInt, f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
.select(f_df["idSite"], f_df["servertimestamp"], 
visitorId_toInt["int_visitorId"],\
 f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
f_df["pageId"] )
  
ratings_df = 
tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
.select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
 tmp["sessionId"],tmp["serverTimePretty"], 
productId_toInt["int_productId"], tmp["pageId"] )
{code}
The SQL version:

{code:none}
ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
sessionId, serverTimePretty, int_productId,  pageId \
FROM f_df \
INNER JOIN visitorId_toInt \
ON f_df.visitorId = visitorId_toInt.visitorId \
INNER JOIN productId_toInt \
ON f_df.productId = productId_toInt.productId \
")
{code}

Here is the result of the join: 

{code:none}
>>>ratings_df.take(3)
[Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
sessionId=None, serverTimePretty=u'375989339', int_productId=1060, pageId=100)]
{code}

The idSite in my dataset is 100 for all the rows, and some how thats being 
assigned to pageId. 



 

  was:
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
{code:none}

>>>df.take(3)
[Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')]
{code}
As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int vers

[jira] [Updated] (SPARK-14425) spark SQL/dataframe join error: mixes the columns up

2016-04-05 Thread venu k tangirala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venu k tangirala updated SPARK-14425:
-
Description: 
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
{code:none}

>>>f_df.take(3)
[Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:20:39 AM', 
productId=374695023, pageId=2617232),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:07 AM', 
productId=374694787, pageId=2617240),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:25 AM', 
productId=374694787, pageId=2617247)]
{code}
As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I tried it with both dataFrame join and SQL join, both have the same 
error:

{code:none}
tmp = f_df\
.join(visitorId_toInt, f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
.select(f_df["idSite"], f_df["servertimestamp"], 
visitorId_toInt["int_visitorId"],\
 f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
f_df["pageId"] )
  
ratings_df = 
tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
.select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
 tmp["sessionId"],tmp["serverTimePretty"], 
productId_toInt["int_productId"], tmp["pageId"] )
{code}
The SQL version:

{code:none}
ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
sessionId, serverTimePretty, int_productId,  pageId \
FROM f_df \
INNER JOIN visitorId_toInt \
ON f_df.visitorId = visitorId_toInt.visitorId \
INNER JOIN productId_toInt \
ON f_df.productId = productId_toInt.productId \
")
{code}

Here is the result of the join: 

{code:none}
>>>ratings_df.take(3)
[Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
sessionId=None, serverTimePretty=u'375989339', int_productId=1060, pageId=100)]
{code}

The idSite in my dataset is 100 for all the rows, and some how thats being 
assigned to pageId. 



 

  was:
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
{code:none}

f_df.take(3)
[Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:20:39 AM', 
productId=374695023, pageId=2617232),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:07 AM', 
productId=374694787, pageId=2617240),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:25 AM', 
productId=374694787, pageId=2617247)]
{code}
As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I 

[jira] [Updated] (SPARK-14425) spark SQL/dataframe join error: mixes the columns up

2016-04-05 Thread venu k tangirala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venu k tangirala updated SPARK-14425:
-
Description: 
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
{code:none}

f_df.take(3)
[Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:20:39 AM', 
productId=374695023, pageId=2617232),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:07 AM', 
productId=374694787, pageId=2617240),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:25 AM', 
productId=374694787, pageId=2617247)]
{code}
As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I tried it with both dataFrame join and SQL join, both have the same 
error:

{code:none}
tmp = f_df\
.join(visitorId_toInt, f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
.select(f_df["idSite"], f_df["servertimestamp"], 
visitorId_toInt["int_visitorId"],\
 f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
f_df["pageId"] )
  
ratings_df = 
tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
.select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
 tmp["sessionId"],tmp["serverTimePretty"], 
productId_toInt["int_productId"], tmp["pageId"] )
{code}
The SQL version:

{code:none}
ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
sessionId, serverTimePretty, int_productId,  pageId \
FROM f_df \
INNER JOIN visitorId_toInt \
ON f_df.visitorId = visitorId_toInt.visitorId \
INNER JOIN productId_toInt \
ON f_df.productId = productId_toInt.productId \
")
{code}

Here is the result of the join: 

{code:none}
>>>ratings_df.take(3)
[Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
sessionId=None, serverTimePretty=u'375989339', int_productId=1060, pageId=100)]
{code}

The idSite in my dataset is 100 for all the rows, and some how thats being 
assigned to pageId. 



 

  was:
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
{code:none}

>>>f_df.take(3)
[Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:20:39 AM', 
productId=374695023, pageId=2617232),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:07 AM', 
productId=374694787, pageId=2617240),
 Row(idSite=100, servertimestamp=1455219299, visitorId=u'8391f66992d536e5', 
sessionId=725873, serverTimePretty=u'Feb 11, 2016 11:21:25 AM', 
productId=374694787, pageId=2617247)]
{code}
As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I 

[jira] [Updated] (SPARK-14425) spark SQL/dataframe join error: mixes the columns up

2016-04-05 Thread venu k tangirala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venu k tangirala updated SPARK-14425:
-
Description: 
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
>>>df.take(3)
[Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')]

As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I tried it with both dataFrame join and SQL join, both have the same 
error:

{code:none}
tmp = f_df\
.join(visitorId_toInt, f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
.select(f_df["idSite"], f_df["servertimestamp"], 
visitorId_toInt["int_visitorId"],\
 f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
f_df["pageId"] )
  
ratings_df = 
tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
.select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
 tmp["sessionId"],tmp["serverTimePretty"], 
productId_toInt["int_productId"], tmp["pageId"] )
{code}
The SQL version:

{code:none}
ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
sessionId, serverTimePretty, int_productId,  pageId \
FROM f_df \
INNER JOIN visitorId_toInt \
ON f_df.visitorId = visitorId_toInt.visitorId \
INNER JOIN productId_toInt \
ON f_df.productId = productId_toInt.productId \
")
{code}

Here is the result of the join: 

{code:none}
>>>ratings_df.take(3)
[Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
sessionId=None, serverTimePretty=u'375989339', int_productId=1060, pageId=100)]
{code}

The idSite in my dataset is 100 for all the rows, and some how thats being 
assigned to pageId. 



 

  was:
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
>>>df.take(3)
[Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')]

As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int

[jira] [Updated] (SPARK-14410) SessionCatalog needs to check function existence

2016-04-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-14410:
--
Summary: SessionCatalog needs to check function existence   (was: 
SessionCatalog needs to check database/table/function existence )

> SessionCatalog needs to check function existence 
> -
>
> Key: SPARK-14410
> URL: https://issues.apache.org/jira/browse/SPARK-14410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Right now, operations for an existing db/table/function in SessionCatalog do 
> not really check if the db/table/function exists. We should add this check 
> and avoid of doing the check in command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14425) spark SQL/dataframe join error: mixes the columns up

2016-04-05 Thread venu k tangirala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venu k tangirala updated SPARK-14425:
-
Description: 
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
{code:none}

>>>df.take(3)
[Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')]
{code}
As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I tried it with both dataFrame join and SQL join, both have the same 
error:

{code:none}
tmp = f_df\
.join(visitorId_toInt, f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
.select(f_df["idSite"], f_df["servertimestamp"], 
visitorId_toInt["int_visitorId"],\
 f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
f_df["pageId"] )
  
ratings_df = 
tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
.select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
 tmp["sessionId"],tmp["serverTimePretty"], 
productId_toInt["int_productId"], tmp["pageId"] )
{code}
The SQL version:

{code:none}
ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
sessionId, serverTimePretty, int_productId,  pageId \
FROM f_df \
INNER JOIN visitorId_toInt \
ON f_df.visitorId = visitorId_toInt.visitorId \
INNER JOIN productId_toInt \
ON f_df.productId = productId_toInt.productId \
")
{code}

Here is the result of the join: 

{code:none}
>>>ratings_df.take(3)
[Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
sessionId=None, serverTimePretty=u'375989339', int_productId=1060, pageId=100)]
{code}

The idSite in my dataset is 100 for all the rows, and some how thats being 
assigned to pageId. 



 

  was:
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
>>>df.take(3)
[Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')]

As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the j

[jira] [Updated] (SPARK-14410) SessionCatalog needs to check function existence

2016-04-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-14410:
--
Description: Right now, operations for an existing functions in 
SessionCatalog do not really check if the function exists. We should add this 
check and avoid of doing the check in command.  (was: Right now, operations for 
an existing db/table/function in SessionCatalog do not really check if the 
db/table/function exists. We should add this check and avoid of doing the check 
in command.)

> SessionCatalog needs to check function existence 
> -
>
> Key: SPARK-14410
> URL: https://issues.apache.org/jira/browse/SPARK-14410
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>
> Right now, operations for an existing functions in SessionCatalog do not 
> really check if the function exists. We should add this check and avoid of 
> doing the check in command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14425) spark SQL/dataframe join error: mixes the columns up

2016-04-05 Thread venu k tangirala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venu k tangirala updated SPARK-14425:
-
Description: 
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
>>>df.take(3)
[Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')]

As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:py}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I tried it with both dataFrame join and SQL join, both have the same 
error:

{code:py}
tmp = f_df\
.join(visitorId_toInt, f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
.select(f_df["idSite"], f_df["servertimestamp"], 
visitorId_toInt["int_visitorId"],\
 f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
f_df["pageId"] )
  
ratings_df = 
tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
.select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
 tmp["sessionId"],tmp["serverTimePretty"], 
productId_toInt["int_productId"], tmp["pageId"] )
{code}
The SQL version:

{code:py}
ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
sessionId, serverTimePretty, int_productId,  pageId \
FROM f_df \
INNER JOIN visitorId_toInt \
ON f_df.visitorId = visitorId_toInt.visitorId \
INNER JOIN productId_toInt \
ON f_df.productId = productId_toInt.productId \
")
{code}

Here is the result of the join: 

{code:py}
>>>ratings_df.take(3)
[Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
sessionId=None, serverTimePretty=u'375989339', int_productId=1060, pageId=100)]
{code}

The idSite in my dataset is 100 for all the rows, and some how thats being 
assigned to pageId. 



 

  was:
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
>>>df.take(3)
[Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')]

As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL

Now I do the join and get the int versions of user_id and pro

[jira] [Updated] (SPARK-14425) spark SQL/dataframe join error: mixes the columns up

2016-04-05 Thread venu k tangirala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venu k tangirala updated SPARK-14425:
-
Description: 
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
>>>df.take(3)
[Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')]

As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:none}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int versions of user_id and product_id as 
follows, I tried it with both dataFrame join and SQL join, both have the same 
error:

{code:none}
tmp = f_df\
.join(visitorId_toInt, f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
.select(f_df["idSite"], f_df["servertimestamp"], 
visitorId_toInt["int_visitorId"],\
 f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
f_df["pageId"] )
  
ratings_df = 
tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
.select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
 tmp["sessionId"],tmp["serverTimePretty"], 
productId_toInt["int_productId"], tmp["pageId"] )
{code}
The SQL version:

{code:none}
ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
sessionId, serverTimePretty, int_productId,  pageId \
FROM f_df \
INNER JOIN visitorId_toInt \
ON f_df.visitorId = visitorId_toInt.visitorId \
INNER JOIN productId_toInt \
ON f_df.productId = productId_toInt.productId \
")
{code}

Here is the result of the join: 

{code:py}
>>>ratings_df.take(3)
[Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
sessionId=None, serverTimePretty=u'375989339', int_productId=1060, pageId=100)]
{code}

The idSite in my dataset is 100 for all the rows, and some how thats being 
assigned to pageId. 



 

  was:
I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
>>>df.take(3)
[Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')]

As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
{code:py}

# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL
{code}

Now I do the join and get the int ver

[jira] [Assigned] (SPARK-14132) [Table related commands] Alter partition

2016-04-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-14132:
-

Assignee: Andrew Or

> [Table related commands] Alter partition
> 
>
> Key: SPARK-14132
> URL: https://issues.apache.org/jira/browse/SPARK-14132
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Andrew Or
>
> For alter column command, we have the following tokens.
> TOK_ALTERTABLE_ADDPARTS
> TOK_ALTERTABLE_DROPPARTS
> TOK_ALTERTABLE_CLUSTER_SORT
> TOK_MSCK
> TOK_ALTERTABLE_ARCHIVE/TOK_ALTERTABLE_UNARCHIVE
> For data source tables, we should throw exceptions.
> For Hive tables, we should support them. For now, it should be fine to throw 
> an exception for TOK_ALTERTABLE_ARCHIVE/TOK_ALTERTABLE_UNARCHIVE.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14424) spark-class and related (spark-shell, etc.) no longer work with sbt build as documented

2016-04-05 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-14424:

Description: 
SPARK-13579 disabled building the assembly artifacts. Our shell scripts 
(specifically spark-class but everything depends on it) aren't yet ready for 
this change.
Repro steps:
./build/sbt clean compile assembly && ./bin/spark-shell
Result:
"
Failed to find Spark jars directory (.../assembly/target/scala-2.11/jars).
You need to build Spark before running this program.
"

Follow up fix docs and clarify assembly requirement has been replaced with 
package 


  was:
SPARK-13579 disabled building the assembly artifacts. Our shell scripts 
(specifically spark-class but everything depends on it) aren't yet ready for 
this change.
Repro steps:
./build/sbt clean compile assembly && ./bin/spark-shell

Temporary work around revert SPARK-13579
Follow up redo SPARK-13579 fixing our shell scripts (or fix build instructions 
in docs/building-spark.md if new build instructions required).
Result:
"
Failed to find Spark jars directory (.../assembly/target/scala-2.11/jars).
You need to build Spark before running this program.
"



> spark-class and related (spark-shell, etc.) no longer work with sbt build as 
> documented
> ---
>
> Key: SPARK-14424
> URL: https://issues.apache.org/jira/browse/SPARK-14424
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: holdenk
>
> SPARK-13579 disabled building the assembly artifacts. Our shell scripts 
> (specifically spark-class but everything depends on it) aren't yet ready for 
> this change.
> Repro steps:
> ./build/sbt clean compile assembly && ./bin/spark-shell
> Result:
> "
> Failed to find Spark jars directory (.../assembly/target/scala-2.11/jars).
> You need to build Spark before running this program.
> "
> Follow up fix docs and clarify assembly requirement has been replaced with 
> package 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14425) spark SQL/dataframe join error: mixes the columns up

2016-04-05 Thread venu k tangirala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venu k tangirala updated SPARK-14425:
-
Priority: Blocker  (was: Major)

> spark SQL/dataframe join error: mixes the columns up
> 
>
> Key: SPARK-14425
> URL: https://issues.apache.org/jira/browse/SPARK-14425
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.1
> Environment: databricks cloud
>Reporter: venu k tangirala
>Priority: Blocker
>
> I am running this on databricks cloud.
> I am running a join operation and the result has a the columns mixed up. 
> Here is an example:
> the original df:
> >>>df.take(3)
> [Row(idSite=u'100', servertimestamp=u'1455219299', 
> visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 
> 11, 2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'),
>  Row(idSite=u'100', servertimestamp=u'1455219299', 
> visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 
> 11, 2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'),
>  Row(idSite=u'100', servertimestamp=u'1455219299', 
> visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 
> 11, 2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')]
> As I am trying to build a recommendation system, and ALS Ratings has to be 
> user_is and product_id has to be int, I am mapping them as follows: 
> # mapping string to int for visitors
> visitorId_toInt = f_df.map(lambda 
> x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
> # print visitorId_toInt.take(3)
> visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
> SQL
> # mapping long to int for products
> productId_toInt= f_df.map(lambda 
> x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
> # print productId_toInt.take(3)
> productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
> SQL
> f_df.registerTempTable("f_df") #doing this only for the SQL
> Now I do the join and get the int versions of user_id and product_id as 
> follows, I tried it with both dataFrame join and SQL join, both have the same 
> error:
> tmp = f_df\
> .join(visitorId_toInt, 
> f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
> .select(f_df["idSite"], f_df["servertimestamp"], 
> visitorId_toInt["int_visitorId"],\
>  f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
> f_df["pageId"] )
>   
> ratings_df = 
> tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
> .select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
>  tmp["sessionId"],tmp["serverTimePretty"], 
> productId_toInt["int_productId"], tmp["pageId"] )
> The SQL version:
> ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
> sessionId, serverTimePretty, int_productId,  pageId \
> FROM f_df \
> INNER JOIN visitorId_toInt \
> ON f_df.visitorId = visitorId_toInt.visitorId \
> INNER JOIN productId_toInt \
> ON f_df.productId = productId_toInt.productId \
> ")
> Here is the result of the join: 
> >>>ratings_df.take(3)
> [Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
> sessionId=None, serverTimePretty=u'377347895', int_productId=7936, 
> pageId=100),
>  Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
> sessionId=None, serverTimePretty=u'377347895', int_productId=7936, 
> pageId=100),
>  Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
> sessionId=None, serverTimePretty=u'375989339', int_productId=1060, 
> pageId=100)]
> The idSite in my dataset is 100 for all the rows, and some how thats being 
> assigned to pageId. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14424) spark-class and related (spark-shell, etc.) no longer work with sbt build as documented

2016-04-05 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-14424:

Summary: spark-class and related (spark-shell, etc.) no longer work with 
sbt build as documented  (was: spark-class and related (spark-shell, etc.) no 
longer work with sbt build)

> spark-class and related (spark-shell, etc.) no longer work with sbt build as 
> documented
> ---
>
> Key: SPARK-14424
> URL: https://issues.apache.org/jira/browse/SPARK-14424
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: holdenk
>
> SPARK-13579 disabled building the assembly artifacts. Our shell scripts 
> (specifically spark-class but everything depends on it) aren't yet ready for 
> this change.
> Repro steps:
> ./build/sbt clean compile assembly && ./bin/spark-shell
> Temporary work around revert SPARK-13579
> Follow up redo SPARK-13579 fixing our shell scripts (or fix build 
> instructions in docs/building-spark.md if new build instructions required).
> Result:
> "
> Failed to find Spark jars directory (.../assembly/target/scala-2.11/jars).
> You need to build Spark before running this program.
> "



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14425) spark SQL/dataframe join error: mixes the columns up

2016-04-05 Thread venu k tangirala (JIRA)
venu k tangirala created SPARK-14425:


 Summary: spark SQL/dataframe join error: mixes the columns up
 Key: SPARK-14425
 URL: https://issues.apache.org/jira/browse/SPARK-14425
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.6.1
 Environment: databricks cloud
Reporter: venu k tangirala


I am running this on databricks cloud.
I am running a join operation and the result has a the columns mixed up. 
Here is an example:

the original df:
>>>df.take(3)
[Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:20:39 AM', productId=u'374695023', pageId=u'2617232'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:07 AM', productId=u'374694787', pageId=u'2617240'),
 Row(idSite=u'100', servertimestamp=u'1455219299', 
visitorId=u'8391f66992d536e5', sessionId=u'725873', serverTimePretty=u'Feb 11, 
2016 11:21:25 AM', productId=u'374694787', pageId=u'2617247')]

As I am trying to build a recommendation system, and ALS Ratings has to be 
user_is and product_id has to be int, I am mapping them as follows: 
# mapping string to int for visitors
visitorId_toInt = f_df.map(lambda 
x:x["visitorId"]).distinct().zipWithUniqueId().toDF(schema=["visitorId","int_visitorId"])
# print visitorId_toInt.take(3)
visitorId_toInt.registerTempTable("visitorId_toInt") #doing this only for the 
SQL

# mapping long to int for products
productId_toInt= f_df.map(lambda 
x:x["productId"]).distinct().zipWithUniqueId().toDF(schema=["productId","int_productId"])
# print productId_toInt.take(3)
productId_toInt.registerTempTable("productId_toInt") #doing this only for the 
SQL

f_df.registerTempTable("f_df") #doing this only for the SQL

Now I do the join and get the int versions of user_id and product_id as 
follows, I tried it with both dataFrame join and SQL join, both have the same 
error:

tmp = f_df\
.join(visitorId_toInt, f_df["visitorId"]==visitorId_toInt["visitorId"],'inner')\
.select(f_df["idSite"], f_df["servertimestamp"], 
visitorId_toInt["int_visitorId"],\
 f_df["sessionId"],f_df["serverTimePretty"], f_df["productId"], 
f_df["pageId"] )
  
ratings_df = 
tmp.join(productId_toInt,tmp["productId"]==productId_toInt["productId"],'inner')\
.select(tmp["idSite"], tmp["servertimestamp"], tmp["int_visitorId"],\
 tmp["sessionId"],tmp["serverTimePretty"], 
productId_toInt["int_productId"], tmp["pageId"] )

The SQL version:
ratings_df = sqlContext.sql("SELECT idSite, servertimestamp, int_visitorId, 
sessionId, serverTimePretty, int_productId,  pageId \
FROM f_df \
INNER JOIN visitorId_toInt \
ON f_df.visitorId = visitorId_toInt.visitorId \
INNER JOIN productId_toInt \
ON f_df.productId = productId_toInt.productId \
")

Here is the result of the join: 
>>>ratings_df.take(3)
[Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453723983, servertimestamp=None, int_visitorId=5774, 
sessionId=None, serverTimePretty=u'377347895', int_productId=7936, pageId=100),
 Row(idSite=1453724668, servertimestamp=None, int_visitorId=4271, 
sessionId=None, serverTimePretty=u'375989339', int_productId=1060, pageId=100)]

The idSite in my dataset is 100 for all the rows, and some how thats being 
assigned to pageId. 



 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14252) Executors do not try to download remote cached blocks

2016-04-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14252.
---
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.0.0

> Executors do not try to download remote cached blocks
> -
>
> Key: SPARK-14252
> URL: https://issues.apache.org/jira/browse/SPARK-14252
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Eric Liang
> Fix For: 2.0.0
>
>
> I noticed this when taking a look at the root cause of SPARK-14209. 2.0.0 
> includes SPARK-12817, which changed the caching code a bit to remove 
> duplication. But it seems to have removed the part where executors check 
> whether other executors contain the needed cached block.
> In 1.6, that was done by the call to {{BlockManager.get}} in 
> {{CacheManager.getOrCompute}}. But in the new version, {{RDD.iterator}} calls 
> {{BlockManager.getOrElseUpdate}}, which never calls {{BlockManager.get}}, and 
> thus the executor never gets block that are cached by other executors, 
> causing the blocks to be instead recomputed locally.
> I wrote a small program that shows this. In 1.6, running with 
> {{--num-executors 2}}, I get 5 blocks cached on each executor, and messages 
> like these in the logs:
> {noformat}
> 16/03/29 13:18:01 DEBUG spark.CacheManager: Looking for partition rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting local block rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Block rdd_0_7 not registered 
> locally
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7 
> from BlockManagerId(1, blah, 58831)
> 1
> {noformat}
> On 2.0, I get (almost) all the 10 partitions cached on *both* executors, 
> because once the second one fails to find a block locally it just recomputes 
> it and caches it. It never tries to download the block from the other 
> executor.  The log messages above, which still exist in the code, don't show 
> up anywhere.
> Here's the code I used for the above (trimmed of some other stuff from my 
> little test harness, so might not compile as is):
> {code}
> val sc = new SparkContext(conf)
> try {
>   val rdd = sc.parallelize(1 to 1000, 10)
>   rdd.cache()
>   rdd.count()
>   // Create a single task that will sleep and block, so that a particular 
> executor is busy.
>   // This should force future tasks to download cached data from that 
> executor.
>   println("Running sleep job..")
>   val thread =  new Thread(new Runnable() {
> override def run(): Unit = {
>   rdd.mapPartitionsWithIndex { (i, iter) =>
> if (i == 0) {
>   Thread.sleep(TimeUnit.MINUTES.toMillis(10))
> }
> iter
>   }.count()
> }
>   })
>   thread.setDaemon(true)
>   thread.start()
>   // Wait a few seconds to make sure the task is running (too lazy for 
> listeners)
>   println("Waiting for tasks to start...")
>   TimeUnit.SECONDS.sleep(10)
>   // Now run a job that will touch everything and should use the cached 
> data.
>   val cnt = rdd.map(_*2).count()
>   println(s"Counted $cnt elements.")
>   println("Killing sleep job.")
>   thread.interrupt()
>   thread.join()
> } finally {
>   sc.stop()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8338) Ganglia fails to start

2016-04-05 Thread Leonardo Apolonio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227757#comment-15227757
 ] 

Leonardo Apolonio commented on SPARK-8338:
--

I am also seeing the error that [~mikeyreilly] reported in 
spark-1.6.1-bin-hadoop2.6

> Ganglia fails to start
> --
>
> Key: SPARK-8338
> URL: https://issues.apache.org/jira/browse/SPARK-8338
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.4.0
>Reporter: Vladimir Vladimirov
>Assignee: Vladimir Vladimirov
>Priority: Minor
> Fix For: 1.5.0
>
>
> Exception
> {code}
> Starting httpd: httpd: Syntax error on line 154 of 
> /etc/httpd/conf/httpd.conf: Cannot load /etc/httpd/modules/mod_authz_core.so 
> into server: /etc/httpd/modules/mod_authz_core.so: cannot open shared object 
> file: No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14416) Add thread-safe comments for CoarseGrainedSchedulerBackend's fields

2016-04-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-14416.
---
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.0.0

> Add thread-safe comments for CoarseGrainedSchedulerBackend's fields
> ---
>
> Key: SPARK-14416
> URL: https://issues.apache.org/jira/browse/SPARK-14416
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> While I was reviewing https://github.com/apache/spark/pull/12078, I found 
> most of CoarseGrainedSchedulerBackend's mutable fields doesn't have any 
> comments about the thread-safe assumptions and it's hard for people to figure 
> out which part of codes should be protected by the lock. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14416) Add thread-safe comments for CoarseGrainedSchedulerBackend's fields

2016-04-05 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-14416:
--
Target Version/s: 2.0.0

> Add thread-safe comments for CoarseGrainedSchedulerBackend's fields
> ---
>
> Key: SPARK-14416
> URL: https://issues.apache.org/jira/browse/SPARK-14416
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.0
>
>
> While I was reviewing https://github.com/apache/spark/pull/12078, I found 
> most of CoarseGrainedSchedulerBackend's mutable fields doesn't have any 
> comments about the thread-safe assumptions and it's hard for people to figure 
> out which part of codes should be protected by the lock. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-05 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227743#comment-15227743
 ] 

Yong Tang commented on SPARK-14409:
---

[~mlnick] I can work on this issue if no one has started yet. Thanks.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-04-05 Thread Jason Piper (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227722#comment-15227722
 ] 

Jason Piper commented on SPARK-14393:
-

If this isn't a bug, I guess it needs to be clear in the documentation what the 
intended behaviour is here - I assumed that it would guarantee a 
monotonically-increasing in the end result.

> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jason Piper
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> +---+
> >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  1|
> |  0|
> |  0|
> |  1|
> |  2|
> |  3|
> |  0|
> |  1|
> |  2|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14424) spark-class and related (spark-shell, etc.) no longer work with sbt build

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14424:


Assignee: (was: Apache Spark)

> spark-class and related (spark-shell, etc.) no longer work with sbt build
> -
>
> Key: SPARK-14424
> URL: https://issues.apache.org/jira/browse/SPARK-14424
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: holdenk
>
> SPARK-13579 disabled building the assembly artifacts. Our shell scripts 
> (specifically spark-class but everything depends on it) aren't yet ready for 
> this change.
> Repro steps:
> ./build/sbt clean compile assembly && ./bin/spark-shell
> Temporary work around revert SPARK-13579
> Follow up redo SPARK-13579 fixing our shell scripts (or fix build 
> instructions in docs/building-spark.md if new build instructions required).
> Result:
> "
> Failed to find Spark jars directory (.../assembly/target/scala-2.11/jars).
> You need to build Spark before running this program.
> "



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14424) spark-class and related (spark-shell, etc.) no longer work with sbt build

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14424:


Assignee: Apache Spark

> spark-class and related (spark-shell, etc.) no longer work with sbt build
> -
>
> Key: SPARK-14424
> URL: https://issues.apache.org/jira/browse/SPARK-14424
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: holdenk
>Assignee: Apache Spark
>
> SPARK-13579 disabled building the assembly artifacts. Our shell scripts 
> (specifically spark-class but everything depends on it) aren't yet ready for 
> this change.
> Repro steps:
> ./build/sbt clean compile assembly && ./bin/spark-shell
> Temporary work around revert SPARK-13579
> Follow up redo SPARK-13579 fixing our shell scripts (or fix build 
> instructions in docs/building-spark.md if new build instructions required).
> Result:
> "
> Failed to find Spark jars directory (.../assembly/target/scala-2.11/jars).
> You need to build Spark before running this program.
> "



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14424) spark-class and related (spark-shell, etc.) no longer work with sbt build

2016-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227707#comment-15227707
 ] 

Apache Spark commented on SPARK-14424:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/12196

> spark-class and related (spark-shell, etc.) no longer work with sbt build
> -
>
> Key: SPARK-14424
> URL: https://issues.apache.org/jira/browse/SPARK-14424
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: holdenk
>
> SPARK-13579 disabled building the assembly artifacts. Our shell scripts 
> (specifically spark-class but everything depends on it) aren't yet ready for 
> this change.
> Repro steps:
> ./build/sbt clean compile assembly && ./bin/spark-shell
> Temporary work around revert SPARK-13579
> Follow up redo SPARK-13579 fixing our shell scripts (or fix build 
> instructions in docs/building-spark.md if new build instructions required).
> Result:
> "
> Failed to find Spark jars directory (.../assembly/target/scala-2.11/jars).
> You need to build Spark before running this program.
> "



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14424) spark-class and related (spark-shell, etc.) no longer work with sbt build

2016-04-05 Thread holdenk (JIRA)
holdenk created SPARK-14424:
---

 Summary: spark-class and related (spark-shell, etc.) no longer 
work with sbt build
 Key: SPARK-14424
 URL: https://issues.apache.org/jira/browse/SPARK-14424
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: holdenk


SPARK-13579 disabled building the assembly artifacts. Our shell scripts 
(specifically spark-class but everything depends on it) aren't yet ready for 
this change.
Repro steps:
./build/sbt clean compile assembly && ./bin/spark-shell

Temporary work around revert SPARK-13579
Follow up redo SPARK-13579 fixing our shell scripts (or fix build instructions 
in docs/building-spark.md if new build instructions required).
Result:
"
Failed to find Spark jars directory (.../assembly/target/scala-2.11/jars).
You need to build Spark before running this program.
"




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12741) DataFrame count method return wrong size.

2016-04-05 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227688#comment-15227688
 ] 

Stephane Maarek commented on SPARK-12741:
-

Hi,

May be related to:
http://stackoverflow.com/questions/36438898/spark-dataframe-count-doesnt-return-the-same-results-when-run-twice
 

I don't have code to generate the input file, it's just a simple hive table 
though.

Cheers,
Stephane

> DataFrame count method return wrong size.
> -
>
> Key: SPARK-12741
> URL: https://issues.apache.org/jira/browse/SPARK-12741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Sasi
>
> Hi,
> I'm updating my report.
> I'm working with Spark 1.5.2, (used to be 1.5.0), I have a DataFrame and I 
> have 2 method, one for collect data and other for count.
> method doQuery looks like:
> {code}
> dataFrame.collect()
> {code}
> method doQueryCount looks like:
> {code}
> dataFrame.count()
> {code}
> I have few scenarios with few results:
> 1) Non data exists on my NoSQLDatabase results: count 0 and collect() 0
> 2) 3 rows exists results: count 0 and collect 3.
> 3) 5 rows exists results: count 2 and collect 5. 
> I tried to change the count code to the below code, but got the same results 
> as I mentioned above.
> {code}
> dataFrame.sql("select count(*) from tbl").count/collect[0]
> {code}
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14249) Change MLReader.read to be a property for PySpark

2016-04-05 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227686#comment-15227686
 ] 

Miao Wang commented on SPARK-14249:
---

[~josephkb]I will take a look. Now, I am working on a the SPARK-14392. Close to 
finish. I am adding the test case now. By the way, can you add me to white 
list? I have one PR that the auto-test is not triggered yet. 

Thanks!

Miao

> Change MLReader.read to be a property for PySpark
> -
>
> Key: SPARK-14249
> URL: https://issues.apache.org/jira/browse/SPARK-14249
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> To match MLWritable.write and SQLContext.read, it will be good to make the 
> PySpark MLReader classmethod {{read}} be a property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14413) For data source tables, we should not allow users to set/change partition locations

2016-04-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-14413.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> For data source tables, we should not allow users to set/change partition 
> locations
> ---
>
> Key: SPARK-14413
> URL: https://issues.apache.org/jira/browse/SPARK-14413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> This is a follow-up of https://issues.apache.org/jira/browse/SPARK-14129. For 
> data source tables, we rely on partition discovery. So, we do not allow that 
> users set/change partition locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14413) For data source tables, we should not allow users to set/change partition locations

2016-04-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227683#comment-15227683
 ] 

Yin Huai commented on SPARK-14413:
--

It has been resolved by https://github.com/apache/spark/pull/12186.

> For data source tables, we should not allow users to set/change partition 
> locations
> ---
>
> Key: SPARK-14413
> URL: https://issues.apache.org/jira/browse/SPARK-14413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> This is a follow-up of https://issues.apache.org/jira/browse/SPARK-14129. For 
> data source tables, we rely on partition discovery. So, we do not allow that 
> users set/change partition locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14413) For data source tables, we should not allow users to set/change partition locations

2016-04-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-14413:
-
Summary: For data source tables, we should not allow users to set/change 
partition locations  (was: For data source tables, we should allow users to 
set/change partition locations)

> For data source tables, we should not allow users to set/change partition 
> locations
> ---
>
> Key: SPARK-14413
> URL: https://issues.apache.org/jira/browse/SPARK-14413
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Andrew Or
>
> This is a follow-up of https://issues.apache.org/jira/browse/SPARK-14129. For 
> data source tables, we rely on partition discovery. So, we do not allow that 
> users set/change partition locations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14296) whole stage codegen support for Dataset.map

2016-04-05 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-14296.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12087
[https://github.com/apache/spark/pull/12087]

> whole stage codegen support for Dataset.map 
> 
>
> Key: SPARK-14296
> URL: https://issues.apache.org/jira/browse/SPARK-14296
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13184) Support minPartitions parameter for JSON and CSV datasources as options

2016-04-05 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227671#comment-15227671
 ] 

Takeshi Yamamuro commented on SPARK-13184:
--

Yeah, lowering `maxPartitionBytes` increases #partitions.
These options recently were merged in the master, so you cannot use them in 
v1.6-.

> Support minPartitions parameter for JSON and CSV datasources as options
> ---
>
> Key: SPARK-13184
> URL: https://issues.apache.org/jira/browse/SPARK-13184
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> After looking through the pull requests below at Spark CSV datasources,
> https://github.com/databricks/spark-csv/pull/256
> https://github.com/databricks/spark-csv/issues/141
> https://github.com/databricks/spark-csv/pull/186
> It looks Spark might need to be able to set {{minPartitions}}.
> {{repartition()}} or {{coalesce()}} can be alternatives but it looks it needs 
> to shuffle the data for most cases.
> Although I am still not sure if it needs this, I will open this ticket just 
> for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14423) Handle jar conflict issue when uploading to distributed cache

2016-04-05 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227664#comment-15227664
 ] 

Saisai Shao commented on SPARK-14423:
-

I will fix it soon.

> Handle jar conflict issue when uploading to distributed cache
> -
>
> Key: SPARK-14423
> URL: https://issues.apache.org/jira/browse/SPARK-14423
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>
> Currently with the introduction of assembly-free deployment of Spark, by 
> default yarn#client will upload all the jars in assembly to HDFS staging 
> folder. If the jars in assembly and specified with \--jars have the same 
> name, this will introduce exception while downloading these jars and make the 
> application fail to run.
> Here is the exception when running example with {{run-example}}:
> {noformat}
> 16/04/06 10:29:48 INFO Client: Application report for 
> application_1459907402325_0004 (state: FAILED)
> 16/04/06 10:29:48 INFO Client:
>client token: N/A
>diagnostics: Application application_1459907402325_0004 failed 2 times 
> due to AM Container for appattempt_1459907402325_0004_02 exited with  
> exitCode: -1000
> For more detailed output, check application tracking 
> page:http://hw12100.local:8088/proxy/application_1459907402325_0004/Then, 
> click on links to logs of each attempt.
> Diagnostics: Resource 
> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar
>  changed on src filesystem (expected 1459909780508, was 1459909782590
> java.io.IOException: Resource 
> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar
>  changed on src filesystem (expected 1459909780508, was 1459909782590
>   at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
>   at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
>   at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
>   at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The problem is that this jar {{avro-mapred-1.7.7-hadoop2.jar}} both existed 
> in assembly and example folder.
> We should handle this situation, since now spark example is failed to run 
> under yarn mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14423) Handle jar conflict issue when uploading to distributed cache

2016-04-05 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-14423:
---

 Summary: Handle jar conflict issue when uploading to distributed 
cache
 Key: SPARK-14423
 URL: https://issues.apache.org/jira/browse/SPARK-14423
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.0.0
Reporter: Saisai Shao


Currently with the introduction of assembly-free deployment of Spark, by 
default yarn#client will upload all the jars in assembly to HDFS staging 
folder. If the jars in assembly and specified with \--jars have the same name, 
this will introduce exception while downloading these jars and make the 
application fail to run.

Here is the exception when running example with {{run-example}}:

{noformat}
16/04/06 10:29:48 INFO Client: Application report for 
application_1459907402325_0004 (state: FAILED)
16/04/06 10:29:48 INFO Client:
 client token: N/A
 diagnostics: Application application_1459907402325_0004 failed 2 times 
due to AM Container for appattempt_1459907402325_0004_02 exited with  
exitCode: -1000
For more detailed output, check application tracking 
page:http://hw12100.local:8088/proxy/application_1459907402325_0004/Then, click 
on links to logs of each attempt.
Diagnostics: Resource 
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar
 changed on src filesystem (expected 1459909780508, was 1459909782590
java.io.IOException: Resource 
hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar
 changed on src filesystem (expected 1459909780508, was 1459909782590
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{noformat}

The problem is that this jar {{avro-mapred-1.7.7-hadoop2.jar}} both existed in 
assembly and example folder.

We should handle this situation, since now spark example is failed to run under 
yarn mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14344) saveAsParquetFile creates _metadata file even when disabled

2016-04-05 Thread Kashish Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kashish Jain updated SPARK-14344:
-
Summary: saveAsParquetFile creates _metadata file even when disabled  (was: 
saveAsParquetFile creates _metafile even when disabled)

> saveAsParquetFile creates _metadata file even when disabled
> ---
>
> Key: SPARK-14344
> URL: https://issues.apache.org/jira/browse/SPARK-14344
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.1, 1.3.1
>Reporter: Kashish Jain
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Specifying the property "spark.hadoop.parquet.enable.summary-metadata false" 
> in spark.property does not prevent the creation of _metadata file in case of 
> rdd.saveAsParquetFile("TableName")



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13184) Support minPartitions parameter for JSON and CSV datasources as options

2016-04-05 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227657#comment-15227657
 ] 

koert kuipers commented on SPARK-13184:
---

i am not familiar with those settings.
are they respected by all data sources that are based on HadoopFsRelation (or 
whatever is the successor of HadoopFsRelation)? 
by lowering maxPartitionBytes can i increase the number of partitions? and this 
is pushed down to the reading of the data, not a shuffle afterwards?
thanks! best, koert

> Support minPartitions parameter for JSON and CSV datasources as options
> ---
>
> Key: SPARK-13184
> URL: https://issues.apache.org/jira/browse/SPARK-13184
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> After looking through the pull requests below at Spark CSV datasources,
> https://github.com/databricks/spark-csv/pull/256
> https://github.com/databricks/spark-csv/issues/141
> https://github.com/databricks/spark-csv/pull/186
> It looks Spark might need to be able to set {{minPartitions}}.
> {{repartition()}} or {{coalesce()}} can be alternatives but it looks it needs 
> to shuffle the data for most cases.
> Although I am still not sure if it needs this, I will open this ticket just 
> for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14300) Scala MLlib examples code merge and clean up

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14300:


Assignee: (was: Apache Spark)

> Scala MLlib examples code merge and clean up
> 
>
> Key: SPARK-14300
> URL: https://issues.apache.org/jira/browse/SPARK-14300
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/mllib:
> * scala/mllib
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * Unsure code duplications (need doube check)
> ** AbstractParams.scala
> ** BinaryClassification.scala
> ** Correlations.scala
> ** CosineSimilarity.scala
> ** DenseGaussianMixture.scala
> ** FPGrowthExample.scala
> ** MovieLensALS.scala
> ** MultivariateSummarizer.scala
> ** RandomRDDGeneration.scala
> ** SampledRDDs.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14300) Scala MLlib examples code merge and clean up

2016-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227655#comment-15227655
 ] 

Apache Spark commented on SPARK-14300:
--

User 'keypointt' has created a pull request for this issue:
https://github.com/apache/spark/pull/12195

> Scala MLlib examples code merge and clean up
> 
>
> Key: SPARK-14300
> URL: https://issues.apache.org/jira/browse/SPARK-14300
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/mllib:
> * scala/mllib
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * Unsure code duplications (need doube check)
> ** AbstractParams.scala
> ** BinaryClassification.scala
> ** Correlations.scala
> ** CosineSimilarity.scala
> ** DenseGaussianMixture.scala
> ** FPGrowthExample.scala
> ** MovieLensALS.scala
> ** MultivariateSummarizer.scala
> ** RandomRDDGeneration.scala
> ** SampledRDDs.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14300) Scala MLlib examples code merge and clean up

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14300:


Assignee: Apache Spark

> Scala MLlib examples code merge and clean up
> 
>
> Key: SPARK-14300
> URL: https://issues.apache.org/jira/browse/SPARK-14300
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/mllib:
> * scala/mllib
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * Unsure code duplications (need doube check)
> ** AbstractParams.scala
> ** BinaryClassification.scala
> ** Correlations.scala
> ** CosineSimilarity.scala
> ** DenseGaussianMixture.scala
> ** FPGrowthExample.scala
> ** MovieLensALS.scala
> ** MultivariateSummarizer.scala
> ** RandomRDDGeneration.scala
> ** SampledRDDs.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14373) PySpark ml RandomForestClassifier, Regressor support export/import

2016-04-05 Thread Kai Jiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227646#comment-15227646
 ] 

Kai Jiang commented on SPARK-14373:
---

I would work on this one.

> PySpark ml RandomForestClassifier, Regressor support export/import
> --
>
> Key: SPARK-14373
> URL: https://issues.apache.org/jira/browse/SPARK-14373
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14412) spark.ml ALS prefered storage level Params

2016-04-05 Thread Rishabh Bhardwaj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227639#comment-15227639
 ] 

Rishabh Bhardwaj commented on SPARK-14412:
--

I can take this up if no one has started on it.

> spark.ml ALS prefered storage level Params
> --
>
> Key: SPARK-14412
> URL: https://issues.apache.org/jira/browse/SPARK-14412
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> spark.mllib ALS supports {{setIntermediateRDDStorageLevel}} and 
> {{setFinalRDDStorageLevel}}.  Those should be added as Params in spark.ml, 
> but they should be in group "expertParam" since few users will need them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13211) StreamingContext throws NoSuchElementException when created from non-existent checkpoint directory

2016-04-05 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-13211.
--
   Resolution: Fixed
 Assignee: Sean Owen
Fix Version/s: 2.0.0

> StreamingContext throws NoSuchElementException when created from non-existent 
> checkpoint directory
> --
>
> Key: SPARK-13211
> URL: https://issues.apache.org/jira/browse/SPARK-13211
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.0.0
>
>
> {code}
> scala> new StreamingContext("_checkpoint")
> 16/02/05 08:51:10 INFO Checkpoint: Checkpoint directory _checkpoint does not 
> exist
> java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:108)
>   at 
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:114)
>   ... 43 elided
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12469) Consistent Accumulators for Spark

2016-04-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12469:

Target Version/s: 2.0.0

> Consistent Accumulators for Spark
> -
>
> Key: SPARK-12469
> URL: https://issues.apache.org/jira/browse/SPARK-12469
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: holdenk
>
> Tasks executed on Spark workers are unable to modify values from the driver, 
> and accumulators are the one exception for this. Accumulators in Spark are 
> implemented in such a way that when a stage is recomputed (say for cache 
> eviction) the accumulator will be updated a second time. This makes 
> accumulators inside of transformations more difficult to use for things like 
> counting invalid records (one of the primary potential use cases of 
> collecting side information during a transformation). However in some cases 
> this counting during re-evaluation is exactly the behaviour we want (say in 
> tracking total execution time for a particular function). Spark would benefit 
> from a version of accumulators which did not double count even if stages were 
> re-executed.
> Motivating example:
> {code}
> val parseTime = sc.accumulator(0L)
> val parseFailures = sc.accumulator(0L)
> val parsedData = sc.textFile(...).flatMap { line =>
>   val start = System.currentTimeMillis()
>   val parsed = Try(parse(line))
>   if (parsed.isFailure) parseFailures += 1
>   parseTime += System.currentTimeMillis() - start
>   parsed.toOption
> }
> parsedData.cache()
> val resultA = parsedData.map(...).filter(...).count()
> // some intervening code.  Almost anything could happen here -- some of 
> parsedData may
> // get kicked out of the cache, or an executor where data was cached might 
> get lost
> val resultB = parsedData.filter(...).map(...).flatMap(...).count()
> // now we look at the accumulators
> {code}
> Here we would want parseFailures to only have been added to once for every 
> line which failed to parse.  Unfortunately, the current Spark accumulator API 
> doesn’t support the current parseFailures use case since if some data had 
> been evicted its possible that it will be double counted.
> See the full design document at 
> https://docs.google.com/document/d/1lR_l1g3zMVctZXrcVjFusq2iQVpr4XvRK_UUDsDr6nk/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14422) Improve handling of optional configs in SQLConf

2016-04-05 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-14422:
--

 Summary: Improve handling of optional configs in SQLConf
 Key: SPARK-14422
 URL: https://issues.apache.org/jira/browse/SPARK-14422
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin
Priority: Minor


As Michael showed here: 
https://github.com/apache/spark/pull/12119/files/69aa1a005cc7003ab62d6dfcdef42181b053eaed#r58634150

Handling of optional configs in SQLConf is a little sub-optimal right now. We 
should clean that up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14359) Improve user experience for typed aggregate functions in Java

2016-04-05 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14359.
-
   Resolution: Fixed
 Assignee: Eric Liang
Fix Version/s: 2.0.0

> Improve user experience for typed aggregate functions in Java
> -
>
> Key: SPARK-14359
> URL: https://issues.apache.org/jira/browse/SPARK-14359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Eric Liang
> Fix For: 2.0.0
>
>
> See the Scala version in SPARK-14285. The main problem we'd need to work 
> around is that Java cannot return primitive types in generics, and as a 
> result we would have to return boxed types.
> One requirement is that we should add tests for both Java 7 style (in 
> sql/core) and Java 8 style lambdas (in external/java-8...).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-04-05 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227551#comment-15227551
 ] 

Takeshi Yamamuro edited comment on SPARK-14393 at 4/6/16 2:08 AM:
--

Seems different `MonotonicallyIncreasingID` instances share the same stage, 
that is the same partition id. However, I'm not 100% sure this is a wrong 
behaviour because we can easily break this monotonically-increasing semantics 
by using repartition/coalesce after  select(monotonicallyIncreasingId()).


was (Author: maropu):
Seems different `MonotonicallyIncreasingID` instances wrongly shares the same 
stage, that is the same partition id.

> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jason Piper
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> +---+
> >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  1|
> |  0|
> |  0|
> |  1|
> |  2|
> |  3|
> |  0|
> |  1|
> |  2|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14400) ScriptTransformation does not fail the job for bad user command

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14400:


Assignee: Apache Spark

> ScriptTransformation does not fail the job for bad user command
> ---
>
> Key: SPARK-14400
> URL: https://issues.apache.org/jira/browse/SPARK-14400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Minor
>
> If the `script` to be ran is an incorrect command, Spark does not catch the 
> failure in running the sub-process and the job is marked as successful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14400) ScriptTransformation does not fail the job for bad user command

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14400:


Assignee: (was: Apache Spark)

> ScriptTransformation does not fail the job for bad user command
> ---
>
> Key: SPARK-14400
> URL: https://issues.apache.org/jira/browse/SPARK-14400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> If the `script` to be ran is an incorrect command, Spark does not catch the 
> failure in running the sub-process and the job is marked as successful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14400) ScriptTransformation does not fail the job for bad user command

2016-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227566#comment-15227566
 ] 

Apache Spark commented on SPARK-14400:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/12194

> ScriptTransformation does not fail the job for bad user command
> ---
>
> Key: SPARK-14400
> URL: https://issues.apache.org/jira/browse/SPARK-14400
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Tejas Patil
>Priority: Minor
>
> If the `script` to be ran is an incorrect command, Spark does not catch the 
> failure in running the sub-process and the job is marked as successful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14252) Executors do not try to download remote cached blocks

2016-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227562#comment-15227562
 ] 

Apache Spark commented on SPARK-14252:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/12193

> Executors do not try to download remote cached blocks
> -
>
> Key: SPARK-14252
> URL: https://issues.apache.org/jira/browse/SPARK-14252
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> I noticed this when taking a look at the root cause of SPARK-14209. 2.0.0 
> includes SPARK-12817, which changed the caching code a bit to remove 
> duplication. But it seems to have removed the part where executors check 
> whether other executors contain the needed cached block.
> In 1.6, that was done by the call to {{BlockManager.get}} in 
> {{CacheManager.getOrCompute}}. But in the new version, {{RDD.iterator}} calls 
> {{BlockManager.getOrElseUpdate}}, which never calls {{BlockManager.get}}, and 
> thus the executor never gets block that are cached by other executors, 
> causing the blocks to be instead recomputed locally.
> I wrote a small program that shows this. In 1.6, running with 
> {{--num-executors 2}}, I get 5 blocks cached on each executor, and messages 
> like these in the logs:
> {noformat}
> 16/03/29 13:18:01 DEBUG spark.CacheManager: Looking for partition rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting local block rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Block rdd_0_7 not registered 
> locally
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7 
> from BlockManagerId(1, blah, 58831)
> 1
> {noformat}
> On 2.0, I get (almost) all the 10 partitions cached on *both* executors, 
> because once the second one fails to find a block locally it just recomputes 
> it and caches it. It never tries to download the block from the other 
> executor.  The log messages above, which still exist in the code, don't show 
> up anywhere.
> Here's the code I used for the above (trimmed of some other stuff from my 
> little test harness, so might not compile as is):
> {code}
> val sc = new SparkContext(conf)
> try {
>   val rdd = sc.parallelize(1 to 1000, 10)
>   rdd.cache()
>   rdd.count()
>   // Create a single task that will sleep and block, so that a particular 
> executor is busy.
>   // This should force future tasks to download cached data from that 
> executor.
>   println("Running sleep job..")
>   val thread =  new Thread(new Runnable() {
> override def run(): Unit = {
>   rdd.mapPartitionsWithIndex { (i, iter) =>
> if (i == 0) {
>   Thread.sleep(TimeUnit.MINUTES.toMillis(10))
> }
> iter
>   }.count()
> }
>   })
>   thread.setDaemon(true)
>   thread.start()
>   // Wait a few seconds to make sure the task is running (too lazy for 
> listeners)
>   println("Waiting for tasks to start...")
>   TimeUnit.SECONDS.sleep(10)
>   // Now run a job that will touch everything and should use the cached 
> data.
>   val cnt = rdd.map(_*2).count()
>   println(s"Counted $cnt elements.")
>   println("Killing sleep job.")
>   thread.interrupt()
>   thread.join()
> } finally {
>   sc.stop()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14252) Executors do not try to download remote cached blocks

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14252:


Assignee: Apache Spark

> Executors do not try to download remote cached blocks
> -
>
> Key: SPARK-14252
> URL: https://issues.apache.org/jira/browse/SPARK-14252
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> I noticed this when taking a look at the root cause of SPARK-14209. 2.0.0 
> includes SPARK-12817, which changed the caching code a bit to remove 
> duplication. But it seems to have removed the part where executors check 
> whether other executors contain the needed cached block.
> In 1.6, that was done by the call to {{BlockManager.get}} in 
> {{CacheManager.getOrCompute}}. But in the new version, {{RDD.iterator}} calls 
> {{BlockManager.getOrElseUpdate}}, which never calls {{BlockManager.get}}, and 
> thus the executor never gets block that are cached by other executors, 
> causing the blocks to be instead recomputed locally.
> I wrote a small program that shows this. In 1.6, running with 
> {{--num-executors 2}}, I get 5 blocks cached on each executor, and messages 
> like these in the logs:
> {noformat}
> 16/03/29 13:18:01 DEBUG spark.CacheManager: Looking for partition rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting local block rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Block rdd_0_7 not registered 
> locally
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7 
> from BlockManagerId(1, blah, 58831)
> 1
> {noformat}
> On 2.0, I get (almost) all the 10 partitions cached on *both* executors, 
> because once the second one fails to find a block locally it just recomputes 
> it and caches it. It never tries to download the block from the other 
> executor.  The log messages above, which still exist in the code, don't show 
> up anywhere.
> Here's the code I used for the above (trimmed of some other stuff from my 
> little test harness, so might not compile as is):
> {code}
> val sc = new SparkContext(conf)
> try {
>   val rdd = sc.parallelize(1 to 1000, 10)
>   rdd.cache()
>   rdd.count()
>   // Create a single task that will sleep and block, so that a particular 
> executor is busy.
>   // This should force future tasks to download cached data from that 
> executor.
>   println("Running sleep job..")
>   val thread =  new Thread(new Runnable() {
> override def run(): Unit = {
>   rdd.mapPartitionsWithIndex { (i, iter) =>
> if (i == 0) {
>   Thread.sleep(TimeUnit.MINUTES.toMillis(10))
> }
> iter
>   }.count()
> }
>   })
>   thread.setDaemon(true)
>   thread.start()
>   // Wait a few seconds to make sure the task is running (too lazy for 
> listeners)
>   println("Waiting for tasks to start...")
>   TimeUnit.SECONDS.sleep(10)
>   // Now run a job that will touch everything and should use the cached 
> data.
>   val cnt = rdd.map(_*2).count()
>   println(s"Counted $cnt elements.")
>   println("Killing sleep job.")
>   thread.interrupt()
>   thread.join()
> } finally {
>   sc.stop()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14252) Executors do not try to download remote cached blocks

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14252:


Assignee: (was: Apache Spark)

> Executors do not try to download remote cached blocks
> -
>
> Key: SPARK-14252
> URL: https://issues.apache.org/jira/browse/SPARK-14252
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> I noticed this when taking a look at the root cause of SPARK-14209. 2.0.0 
> includes SPARK-12817, which changed the caching code a bit to remove 
> duplication. But it seems to have removed the part where executors check 
> whether other executors contain the needed cached block.
> In 1.6, that was done by the call to {{BlockManager.get}} in 
> {{CacheManager.getOrCompute}}. But in the new version, {{RDD.iterator}} calls 
> {{BlockManager.getOrElseUpdate}}, which never calls {{BlockManager.get}}, and 
> thus the executor never gets block that are cached by other executors, 
> causing the blocks to be instead recomputed locally.
> I wrote a small program that shows this. In 1.6, running with 
> {{--num-executors 2}}, I get 5 blocks cached on each executor, and messages 
> like these in the logs:
> {noformat}
> 16/03/29 13:18:01 DEBUG spark.CacheManager: Looking for partition rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting local block rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Block rdd_0_7 not registered 
> locally
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7 
> from BlockManagerId(1, blah, 58831)
> 1
> {noformat}
> On 2.0, I get (almost) all the 10 partitions cached on *both* executors, 
> because once the second one fails to find a block locally it just recomputes 
> it and caches it. It never tries to download the block from the other 
> executor.  The log messages above, which still exist in the code, don't show 
> up anywhere.
> Here's the code I used for the above (trimmed of some other stuff from my 
> little test harness, so might not compile as is):
> {code}
> val sc = new SparkContext(conf)
> try {
>   val rdd = sc.parallelize(1 to 1000, 10)
>   rdd.cache()
>   rdd.count()
>   // Create a single task that will sleep and block, so that a particular 
> executor is busy.
>   // This should force future tasks to download cached data from that 
> executor.
>   println("Running sleep job..")
>   val thread =  new Thread(new Runnable() {
> override def run(): Unit = {
>   rdd.mapPartitionsWithIndex { (i, iter) =>
> if (i == 0) {
>   Thread.sleep(TimeUnit.MINUTES.toMillis(10))
> }
> iter
>   }.count()
> }
>   })
>   thread.setDaemon(true)
>   thread.start()
>   // Wait a few seconds to make sure the task is running (too lazy for 
> listeners)
>   println("Waiting for tasks to start...")
>   TimeUnit.SECONDS.sleep(10)
>   // Now run a job that will touch everything and should use the cached 
> data.
>   val cnt = rdd.map(_*2).count()
>   println(s"Counted $cnt elements.")
>   println("Killing sleep job.")
>   thread.interrupt()
>   thread.join()
> } finally {
>   sc.stop()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14393) monotonicallyIncreasingId not monotonically increasing with downstream coalesce

2016-04-05 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227551#comment-15227551
 ] 

Takeshi Yamamuro commented on SPARK-14393:
--

Seems different `MonotonicallyIncreasingID` instances wrongly shares the same 
stage, that is the same partition id.

> monotonicallyIncreasingId not monotonically increasing with downstream 
> coalesce
> ---
>
> Key: SPARK-14393
> URL: https://issues.apache.org/jira/browse/SPARK-14393
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Jason Piper
>
> When utilising monotonicallyIncreasingId with a coalesce, it appears that 
> every partition uses the same offset (0) leading to non-monotonically 
> increasing IDs.
> See examples below
> {code}
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |25769803776|
> |51539607552|
> |77309411328|
> |   103079215104|
> |   128849018880|
> |   163208757248|
> |   188978561024|
> |   214748364800|
> |   240518168576|
> |   266287972352|
> +---+
> >>> sqlContext.range(10).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> |  0|
> +---+
> >>> sqlContext.range(10).repartition(5).select(monotonicallyIncreasingId()).coalesce(1).show()
> +---+
> |monotonicallyincreasingid()|
> +---+
> |  0|
> |  1|
> |  0|
> |  0|
> |  1|
> |  2|
> |  3|
> |  0|
> |  1|
> |  2|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14402) initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14402:


Assignee: Apache Spark  (was: Dongjoon Hyun)

> initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string
> 
>
> Key: SPARK-14402
> URL: https://issues.apache.org/jira/browse/SPARK-14402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.0.0
>
>
> Current, SparkSQL `initCap` is using `toTitleCase` function. However, 
> `UTF8String.toTitleCase` implementation changes only the first letter and 
> just copy the other letters: e.g. *sParK* --> *SParK*. (This is the correct 
> expected implementation of `toTitleCase`.)
> So, The main goal of this issue provides the correct `initCap`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14402) initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14402:


Assignee: Dongjoon Hyun  (was: Apache Spark)

> initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string
> 
>
> Key: SPARK-14402
> URL: https://issues.apache.org/jira/browse/SPARK-14402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.0.0
>
>
> Current, SparkSQL `initCap` is using `toTitleCase` function. However, 
> `UTF8String.toTitleCase` implementation changes only the first letter and 
> just copy the other letters: e.g. *sParK* --> *SParK*. (This is the correct 
> expected implementation of `toTitleCase`.)
> So, The main goal of this issue provides the correct `initCap`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-14402) initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string

2016-04-05 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski reopened SPARK-14402:
-

There's a compilation error after the change.

> initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string
> 
>
> Key: SPARK-14402
> URL: https://issues.apache.org/jira/browse/SPARK-14402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.0.0
>
>
> Current, SparkSQL `initCap` is using `toTitleCase` function. However, 
> `UTF8String.toTitleCase` implementation changes only the first letter and 
> just copy the other letters: e.g. *sParK* --> *SParK*. (This is the correct 
> expected implementation of `toTitleCase`.)
> So, The main goal of this issue provides the correct `initCap`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14402) initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string

2016-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227536#comment-15227536
 ] 

Apache Spark commented on SPARK-14402:
--

User 'jaceklaskowski' has created a pull request for this issue:
https://github.com/apache/spark/pull/12192

> initcap UDF doesn't match Hive/Oracle behavior in lowercasing rest of string
> 
>
> Key: SPARK-14402
> URL: https://issues.apache.org/jira/browse/SPARK-14402
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.0.0
>
>
> Current, SparkSQL `initCap` is using `toTitleCase` function. However, 
> `UTF8String.toTitleCase` implementation changes only the first letter and 
> just copy the other letters: e.g. *sParK* --> *SParK*. (This is the correct 
> expected implementation of `toTitleCase`.)
> So, The main goal of this issue provides the correct `initCap`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-05 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227498#comment-15227498
 ] 

Sun Rui commented on SPARK-12922:
-

cool:)

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14421) Kinesis deaggregation with PySpark

2016-04-05 Thread Brian ONeill (JIRA)
Brian ONeill created SPARK-14421:


 Summary: Kinesis deaggregation with PySpark
 Key: SPARK-14421
 URL: https://issues.apache.org/jira/browse/SPARK-14421
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.1
 Environment: PySpark w/ Kinesis word count example
Reporter: Brian ONeill


I'm creating this issue as a precaution...

We have some preliminary evidence that indicates that KPL de-aggregation for 
Kinesis streams may not work in Spark 1.6.1.  Using the PySpark Kinesis Word 
Count example, we don't receive records when KPL is used to produce the data, 
with aggregation turned on, using masterUrl = local[16].

At the same time, I noticed this thread:
https://forums.aws.amazon.com/message.jspa?messageID=707122

Following the instructions here:
http://brianoneill.blogspot.com/2016/03/pyspark-on-amazon-emr-w-kinesis.html

The example will sometimes work.   When aggregation is disabled, it appears to 
always work.  I'm going to dig a bit deeper, but thought you might have some 
pointers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14252) Executors do not try to download remote cached blocks

2016-04-05 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227493#comment-15227493
 ] 

Eric Liang commented on SPARK-14252:


I'm going to take a look at fixing this

> Executors do not try to download remote cached blocks
> -
>
> Key: SPARK-14252
> URL: https://issues.apache.org/jira/browse/SPARK-14252
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>
> I noticed this when taking a look at the root cause of SPARK-14209. 2.0.0 
> includes SPARK-12817, which changed the caching code a bit to remove 
> duplication. But it seems to have removed the part where executors check 
> whether other executors contain the needed cached block.
> In 1.6, that was done by the call to {{BlockManager.get}} in 
> {{CacheManager.getOrCompute}}. But in the new version, {{RDD.iterator}} calls 
> {{BlockManager.getOrElseUpdate}}, which never calls {{BlockManager.get}}, and 
> thus the executor never gets block that are cached by other executors, 
> causing the blocks to be instead recomputed locally.
> I wrote a small program that shows this. In 1.6, running with 
> {{--num-executors 2}}, I get 5 blocks cached on each executor, and messages 
> like these in the logs:
> {noformat}
> 16/03/29 13:18:01 DEBUG spark.CacheManager: Looking for partition rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting local block rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Block rdd_0_7 not registered 
> locally
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7
> 16/03/29 13:18:01 DEBUG storage.BlockManager: Getting remote block rdd_0_7 
> from BlockManagerId(1, blah, 58831)
> 1
> {noformat}
> On 2.0, I get (almost) all the 10 partitions cached on *both* executors, 
> because once the second one fails to find a block locally it just recomputes 
> it and caches it. It never tries to download the block from the other 
> executor.  The log messages above, which still exist in the code, don't show 
> up anywhere.
> Here's the code I used for the above (trimmed of some other stuff from my 
> little test harness, so might not compile as is):
> {code}
> val sc = new SparkContext(conf)
> try {
>   val rdd = sc.parallelize(1 to 1000, 10)
>   rdd.cache()
>   rdd.count()
>   // Create a single task that will sleep and block, so that a particular 
> executor is busy.
>   // This should force future tasks to download cached data from that 
> executor.
>   println("Running sleep job..")
>   val thread =  new Thread(new Runnable() {
> override def run(): Unit = {
>   rdd.mapPartitionsWithIndex { (i, iter) =>
> if (i == 0) {
>   Thread.sleep(TimeUnit.MINUTES.toMillis(10))
> }
> iter
>   }.count()
> }
>   })
>   thread.setDaemon(true)
>   thread.start()
>   // Wait a few seconds to make sure the task is running (too lazy for 
> listeners)
>   println("Waiting for tasks to start...")
>   TimeUnit.SECONDS.sleep(10)
>   // Now run a job that will touch everything and should use the cached 
> data.
>   val cnt = rdd.map(_*2).count()
>   println(s"Counted $cnt elements.")
>   println("Killing sleep job.")
>   thread.interrupt()
>   thread.join()
> } finally {
>   sc.stop()
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13687) Cleanup pyspark temporary files

2016-04-05 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227482#comment-15227482
 ] 

holdenk commented on SPARK-13687:
-

I'll take this one :)

> Cleanup pyspark temporary files
> ---
>
> Key: SPARK-13687
> URL: https://issues.apache.org/jira/browse/SPARK-13687
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Damir
>
> Every time parallelize is called it creates temporary file for rdd in 
> spark.local.dir/spark-uuid/pyspark-uuid/ directory. This directory deletes 
> when context is closed, but for long running applications with permanently 
> opened context this directory growth infinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14420) keepLastCheckpoint Param for Python LDA with EM

2016-04-05 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-14420:
-

 Summary: keepLastCheckpoint Param for Python LDA with EM
 Key: SPARK-14420
 URL: https://issues.apache.org/jira/browse/SPARK-14420
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Priority: Minor


See linked JIRA for Scala API.  This can add it in spark.ml.  Adding to 
spark.mllib is optional IMO.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14398) Audit non-reserved keyword list in ANTLR4 parser.

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14398:


Assignee: Apache Spark

> Audit non-reserved keyword list in ANTLR4 parser.
> -
>
> Key: SPARK-14398
> URL: https://issues.apache.org/jira/browse/SPARK-14398
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> We need to check if all keywords that were non-reserved in the `old` ANTLR3 
> parser are non-reserved in the ANTLR4 parser. Notable exceptions are join 
> {{LEFT}}, {{RIGHT}} & {{FULL}} keywords; these used to be non-reserved and 
> are now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14398) Audit non-reserved keyword list in ANTLR4 parser.

2016-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227451#comment-15227451
 ] 

Apache Spark commented on SPARK-14398:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/12191

> Audit non-reserved keyword list in ANTLR4 parser.
> -
>
> Key: SPARK-14398
> URL: https://issues.apache.org/jira/browse/SPARK-14398
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
> Fix For: 2.0.0
>
>
> We need to check if all keywords that were non-reserved in the `old` ANTLR3 
> parser are non-reserved in the ANTLR4 parser. Notable exceptions are join 
> {{LEFT}}, {{RIGHT}} & {{FULL}} keywords; these used to be non-reserved and 
> are now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14398) Audit non-reserved keyword list in ANTLR4 parser.

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14398:


Assignee: (was: Apache Spark)

> Audit non-reserved keyword list in ANTLR4 parser.
> -
>
> Key: SPARK-14398
> URL: https://issues.apache.org/jira/browse/SPARK-14398
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
> Fix For: 2.0.0
>
>
> We need to check if all keywords that were non-reserved in the `old` ANTLR3 
> parser are non-reserved in the ANTLR4 parser. Notable exceptions are join 
> {{LEFT}}, {{RIGHT}} & {{FULL}} keywords; these used to be non-reserved and 
> are now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4591) Algorithm/model parity in spark.ml (Scala)

2016-04-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4591:
-
Description: 
This is an umbrella JIRA for porting spark.mllib implementations to use the 
DataFrame-based API defined under spark.ml.  We want to achieve feature parity 
for the next release.

Subtasks cover major algorithm groups.  To pick up a review subtask, please:
* Comment that you are working on it.
* Compare the public APIs of spark.ml vs. spark.mllib.
* Comment on all missing items within spark.ml: algorithms, models, methods, 
features, etc.
* Check for existing JIRAs covering those items.  If there is no existing JIRA, 
create one, and link it to your comment.

This does *not* include:
* Python: We can compare Scala vs. Python in spark.ml itself.
* single-Row prediction: [SPARK-10413]

Also, this does not include the following items (but will eventually):
* User-facing:
** Streaming ML
** evaluation
** fpm
** pmml
** stat
** linalg [SPARK-13944]
* Developer-facing:
** optimization
** random, rdd
** util


  was:
This is an umbrella JIRA for porting spark.mllib implementations to use the 
DataFrame-based API defined under spark.ml.  We want to achieve feature parity 
for the next release.

Subtasks cover major algorithm groups.  To pick up a review subtask, please:
* Comment that you are working on it.
* Compare the public APIs of spark.ml vs. spark.mllib.
* Comment on all missing items within spark.ml: algorithms, models, methods, 
features, etc.
* Check for existing JIRAs covering those items.  If there is no existing JIRA, 
create one, and link it to your comment.

This does *not* include:
* Python: We can compare Scala vs. Python in spark.ml itself.
* single-Row prediction: [SPARK-10413]

Also, this does not include the following items (but will eventually):
* User-facing:
** Streaming ML (to be done under structured streaming in the 2.x line)
** evaluation
** fpm
** pmml
** stat
** linalg [SPARK-13944]
* Developer-facing:
** optimization
** random, rdd
** util



> Algorithm/model parity in spark.ml (Scala)
> --
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve feature 
> parity for the next release.
> Subtasks cover major algorithm groups.  To pick up a review subtask, please:
> * Comment that you are working on it.
> * Compare the public APIs of spark.ml vs. spark.mllib.
> * Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> * Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * single-Row prediction: [SPARK-10413]
> Also, this does not include the following items (but will eventually):
> * User-facing:
> ** Streaming ML
> ** evaluation
> ** fpm
> ** pmml
> ** stat
> ** linalg [SPARK-13944]
> * Developer-facing:
> ** optimization
> ** random, rdd
> ** util



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4591) Algorithm/model parity in spark.ml (Scala)

2016-04-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4591:
-
Description: 
This is an umbrella JIRA for porting spark.mllib implementations to use the 
DataFrame-based API defined under spark.ml.  We want to achieve feature parity 
for the next release.

Subtasks cover major algorithm groups.  To pick up a review subtask, please:
* Comment that you are working on it.
* Compare the public APIs of spark.ml vs. spark.mllib.
* Comment on all missing items within spark.ml: algorithms, models, methods, 
features, etc.
* Check for existing JIRAs covering those items.  If there is no existing JIRA, 
create one, and link it to your comment.

This does *not* include:
* Python: We can compare Scala vs. Python in spark.ml itself.
* single-Row prediction: [SPARK-10413]

Also, this does not include the following items (but will eventually):
* User-facing:
** Streaming ML (to be done under structured streaming in the 2.x line)
** evaluation
** fpm
** pmml
** stat
** linalg [SPARK-13944]
* Developer-facing:
** optimization
** random, rdd
** util


  was:
This is an umbrella JIRA for porting spark.mllib implementations to use the 
DataFrame-based API defined under spark.ml.  We want to achieve feature parity 
for the next release.

Subtasks cover major algorithm groups.  To pick up a review subtask, please:
* Comment that you are working on it.
* Compare the public APIs of spark.ml vs. spark.mllib.
* Comment on all missing items within spark.ml: algorithms, models, methods, 
features, etc.
* Check for existing JIRAs covering those items.  If there is no existing JIRA, 
create one, and link it to your comment.

This does *not* include:
* Python: We can compare Scala vs. Python in spark.ml itself.
* single-Row prediction: [SPARK-10413]

Also, this does not include the following items:
* User-facing:
** Streaming ML (to be done under structured streaming in the 2.x line)
** evaluation
** fpm
** pmml
** stat
** linalg [SPARK-13944]
* Developer-facing:
** optimization
** random, rdd
** util



> Algorithm/model parity in spark.ml (Scala)
> --
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve feature 
> parity for the next release.
> Subtasks cover major algorithm groups.  To pick up a review subtask, please:
> * Comment that you are working on it.
> * Compare the public APIs of spark.ml vs. spark.mllib.
> * Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> * Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * single-Row prediction: [SPARK-10413]
> Also, this does not include the following items (but will eventually):
> * User-facing:
> ** Streaming ML (to be done under structured streaming in the 2.x line)
> ** evaluation
> ** fpm
> ** pmml
> ** stat
> ** linalg [SPARK-13944]
> * Developer-facing:
> ** optimization
> ** random, rdd
> ** util



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4591) Algorithm/model parity in spark.ml (Scala)

2016-04-05 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227445#comment-15227445
 ] 

Joseph K. Bradley commented on SPARK-4591:
--

We will; eventually, we should support everything.  I just noted the highest 
priority items first.

> Algorithm/model parity in spark.ml (Scala)
> --
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve feature 
> parity for the next release.
> Subtasks cover major algorithm groups.  To pick up a review subtask, please:
> * Comment that you are working on it.
> * Compare the public APIs of spark.ml vs. spark.mllib.
> * Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> * Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * single-Row prediction: [SPARK-10413]
> Also, this does not include the following items:
> * User-facing:
> ** Streaming ML (to be done under structured streaming in the 2.x line)
> ** evaluation
> ** fpm
> ** pmml
> ** stat
> ** linalg [SPARK-13944]
> * Developer-facing:
> ** optimization
> ** random, rdd
> ** util



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13184) Support minPartitions parameter for JSON and CSV datasources as options

2016-04-05 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227435#comment-15227435
 ] 

Takeshi Yamamuro commented on SPARK-13184:
--

Seems we can handle this by "spark.sql.files.maxPartitionBytes" and 
"spark.sql.files.openCostInBytes".
I'm not sure we need a newer option to do this.


> Support minPartitions parameter for JSON and CSV datasources as options
> ---
>
> Key: SPARK-13184
> URL: https://issues.apache.org/jira/browse/SPARK-13184
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> After looking through the pull requests below at Spark CSV datasources,
> https://github.com/databricks/spark-csv/pull/256
> https://github.com/databricks/spark-csv/issues/141
> https://github.com/databricks/spark-csv/pull/186
> It looks Spark might need to be able to set {{minPartitions}}.
> {{repartition()}} or {{coalesce()}} can be alternatives but it looks it needs 
> to shuffle the data for most cases.
> Although I am still not sure if it needs this, I will open this ticket just 
> for discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14119) Role management commands (Exception)

2016-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227421#comment-15227421
 ] 

Apache Spark commented on SPARK-14119:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12190

> Role management commands (Exception)
> 
>
> Key: SPARK-14119
> URL: https://issues.apache.org/jira/browse/SPARK-14119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> All role management commands should throw exceptions.
> TOK_CREATEROLE (Exception)
> TOK_DROPROLE (Exception)
> TOK_GRANT (Exception)
> TOK_GRANT_ROLE (Exception)
> TOK_SHOW_GRANT (Exception)
> TOK_SHOW_ROLE_GRANT (Exception)
> TOK_SHOW_ROLE_PRINCIPALS (Exception)
> TOK_SHOW_ROLES (Exception)
> TOK_SHOW_SET_ROLE (Exception)
> TOK_REVOKE (Exception)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14419) Improve the HashedRelation for key fit within Long

2016-04-05 Thread Davies Liu (JIRA)
Davies Liu created SPARK-14419:
--

 Summary: Improve the HashedRelation for key fit within Long
 Key: SPARK-14419
 URL: https://issues.apache.org/jira/browse/SPARK-14419
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


1. Manage the memory by MemoryManager
2. Improve the memory efficiency



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13842) Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType

2016-04-05 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227392#comment-15227392
 ] 

holdenk commented on SPARK-13842:
-

cc [~davies] any thoughts?

> Consider __iter__ and __getitem__ methods for pyspark.sql.types.StructType
> --
>
> Key: SPARK-13842
> URL: https://issues.apache.org/jira/browse/SPARK-13842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.6.1
>Reporter: Shea Parkes
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> It would be nice to consider adding \_\_iter\_\_ and \_\_getitem\_\_ to 
> {{pyspark.sql.types.StructType}}.  Here are some simplistic suggestions:
> {code}
> def __iter__(self):
> """Iterate the fields upon request."""
> return iter(self.fields)
> def __getitem__(self, key):
> """Return the corresponding StructField"""
> _fields_dict = dict(zip(self.names, self.fields))
> try:
> return _fields_dict[key]
> except KeyError:
> raise KeyError('No field named {}'.format(key))
> {code}
> I realize the latter might be a touch more controversial since there could be 
> name collisions.  Still, I doubt there are that many in practice and it would 
> be quite nice to work with.
> Privately, I have more extensive metadata extraction methods overlaid on this 
> class, but I imagine the rest of what I have done might go too far for the 
> common user.  If this request gains traction though, I'll share those other 
> layers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2016-04-05 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas closed SPARK-3821.
---
Resolution: Won't Fix

I'm resolving this as "Won't Fix" due to lack of interest, both on my part and 
on part of the Spark / spark-ec2 project maintainers.

If anyone's interested in picking this up, the code is here: 
https://github.com/nchammas/spark-ec2/tree/packer/image-build

I've mostly moved on from spark-ec2 to work on 
[Flintrock|https://github.com/nchammas/flintrock], which doesn't require custom 
AMIs.

> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
> Attachments: packer-proposal.html
>
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14418) Broadcast.unpersist() in PySpark is not consistent with that in Scala

2016-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227382#comment-15227382
 ] 

Apache Spark commented on SPARK-14418:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12189

> Broadcast.unpersist() in PySpark is not consistent with that in Scala
> -
>
> Key: SPARK-14418
> URL: https://issues.apache.org/jira/browse/SPARK-14418
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Broadcast.unpersist() removes the file, that should be the behavior of 
> destroy().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14418) Broadcast.unpersist() in PySpark is not consistent with that in Scala

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14418:


Assignee: Apache Spark  (was: Davies Liu)

> Broadcast.unpersist() in PySpark is not consistent with that in Scala
> -
>
> Key: SPARK-14418
> URL: https://issues.apache.org/jira/browse/SPARK-14418
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Broadcast.unpersist() removes the file, that should be the behavior of 
> destroy().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14418) Broadcast.unpersist() in PySpark is not consistent with that in Scala

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14418:


Assignee: Davies Liu  (was: Apache Spark)

> Broadcast.unpersist() in PySpark is not consistent with that in Scala
> -
>
> Key: SPARK-14418
> URL: https://issues.apache.org/jira/browse/SPARK-14418
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Broadcast.unpersist() removes the file, that should be the behavior of 
> destroy().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14418) Broadcast.unpersist() in PySpark is not consistent with that in Scala

2016-04-05 Thread Davies Liu (JIRA)
Davies Liu created SPARK-14418:
--

 Summary: Broadcast.unpersist() in PySpark is not consistent with 
that in Scala
 Key: SPARK-14418
 URL: https://issues.apache.org/jira/browse/SPARK-14418
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu


Broadcast.unpersist() removes the file, that should be the behavior of 
destroy().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14392) CountVectorizer Estimator should include binary toggle Param

2016-04-05 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227371#comment-15227371
 ] 

Miao Wang commented on SPARK-14392:
---

I moved the code from CountVectorizerModel to CountVectorizerParams. Current 
tests all passed. Now, I am designing some test case for testing 
CountVectorizer with binary set to true. 

> CountVectorizer Estimator should include binary toggle Param
> 
>
> Key: SPARK-14392
> URL: https://issues.apache.org/jira/browse/SPARK-14392
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> CountVectorizerModel contains a "binary" toggle Param.  The Estimator should 
> contain it as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14417) Cleanup Scala deprecation warnings once we drop 2.10.X

2016-04-05 Thread holdenk (JIRA)
holdenk created SPARK-14417:
---

 Summary: Cleanup Scala deprecation warnings once we drop 2.10.X
 Key: SPARK-14417
 URL: https://issues.apache.org/jira/browse/SPARK-14417
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: holdenk
Priority: Minor


While a previous issue addressed many of the deprecation warnings, since we 
didn't want to introduce scala version specific code there are a number of 
deprecation warnings we can't easily fix. Once we drop Scala 2.10 we should go 
back and cleanup these remaining issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14416) Add thread-safe comments for CoarseGrainedSchedulerBackend's fields

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14416:


Assignee: Apache Spark

> Add thread-safe comments for CoarseGrainedSchedulerBackend's fields
> ---
>
> Key: SPARK-14416
> URL: https://issues.apache.org/jira/browse/SPARK-14416
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Minor
>
> While I was reviewing https://github.com/apache/spark/pull/12078, I found 
> most of CoarseGrainedSchedulerBackend's mutable fields doesn't have any 
> comments about the thread-safe assumptions and it's hard for people to figure 
> out which part of codes should be protected by the lock. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14416) Add thread-safe comments for CoarseGrainedSchedulerBackend's fields

2016-04-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227354#comment-15227354
 ] 

Apache Spark commented on SPARK-14416:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/12188

> Add thread-safe comments for CoarseGrainedSchedulerBackend's fields
> ---
>
> Key: SPARK-14416
> URL: https://issues.apache.org/jira/browse/SPARK-14416
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>Priority: Minor
>
> While I was reviewing https://github.com/apache/spark/pull/12078, I found 
> most of CoarseGrainedSchedulerBackend's mutable fields doesn't have any 
> comments about the thread-safe assumptions and it's hard for people to figure 
> out which part of codes should be protected by the lock. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14416) Add thread-safe comments for CoarseGrainedSchedulerBackend's fields

2016-04-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14416:


Assignee: (was: Apache Spark)

> Add thread-safe comments for CoarseGrainedSchedulerBackend's fields
> ---
>
> Key: SPARK-14416
> URL: https://issues.apache.org/jira/browse/SPARK-14416
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>Priority: Minor
>
> While I was reviewing https://github.com/apache/spark/pull/12078, I found 
> most of CoarseGrainedSchedulerBackend's mutable fields doesn't have any 
> comments about the thread-safe assumptions and it's hard for people to figure 
> out which part of codes should be protected by the lock. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >