[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16014944#comment-16014944
]
Bryan Cutler commented on SPARK-14141:
--
Take a look at SPARK-13534 which will make a Pandas
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15830750#comment-15830750
]
Luke Miner commented on SPARK-14141:
One option is to convert all the categorical variables into
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15579949#comment-15579949
]
holdenk commented on SPARK-14141:
-
Ah sorry for the delay, so doing the cache + count together is done
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15224934#comment-15224934
]
Luke Miner commented on SPARK-14141:
Do you think you could sketch out your method? I'd love to try
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218991#comment-15218991
]
holdenk commented on SPARK-14141:
-
If the data fits in memory on the cluster, cache + count +
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218930#comment-15218930
]
Luke Miner commented on SPARK-14141:
Anecdotally, at least, it seems like a pretty common workflow
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218909#comment-15218909
]
Davies Liu commented on SPARK-14141:
toLocalIterator is better than collect, but will run partitions
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214718#comment-15214718
]
Luke Miner commented on SPARK-14141:
Good to know. Would rdd.toLocalIterator() be a scalable way to
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15213876#comment-15213876
]
Davies Liu commented on SPARK-14141:
toPandas() is just an convenient way to convert a small
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15213206#comment-15213206
]
holdenk commented on SPARK-14141:
-
Its doable, but I'm not sure it belongs in Spark its self. Maybe
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15213199#comment-15213199
]
Luke Miner commented on SPARK-14141:
If that's the case, it sounds like it is doable. One way would
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212590#comment-15212590
]
holdenk commented on SPARK-14141:
-
So with RDDs there is `toLocalIterator` which you could use to do this
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212587#comment-15212587
]
holdenk commented on SPARK-14141:
-
The more I look at this the more I think its not a good fit for Spark.
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212563#comment-15212563
]
Luke Miner commented on SPARK-14141:
Is there any way to do this process in chunks: read a chunk of
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212548#comment-15212548
]
holdenk commented on SPARK-14141:
-
So following up, `from_records` doesn't take dtypes although we could
[
https://issues.apache.org/jira/browse/SPARK-14141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212410#comment-15212410
]
holdenk commented on SPARK-14141:
-
I can take a crack at this, seems pretty reasonable & small.
> Let
16 matches
Mail list logo