This question is related to using Spark and deeplyR. We load a lot of data from oracle in dataframes through a jdbc connection:
dfX <- spark_read_jdbc(spConn, “myconnection", options = list( url = urlDEVdb, driver = "oracle.jdbc.OracleDriver", user = dbt_schema, password = dbt_password, dbtable = pQuery, memory = FALSE # don't cache the whole (big) table )) Then we do a lot of sql statemsnts, and use sdf_register to register the results. Eventually we want to write the final result to a db. Although we have set memory=FALSE, we see all these tables get cached. I notice that counts are triggered (I think this happens before a table is ccahed) and a collect is triggered. Also we think we see that when the tables are registered with sdf_register, looks like it triggers a collect action (almost looks like these are also cached). This leads to a lot of actions (often on the dataframes resulting from the same pipeline) which takes a long time. Questions to people using deeplyR+spark: 1) Is it possible that this memory =false is ignored when reading through jdbc? 2) can someone confirm that there is a lot of automatic caching happening (and hence a lot of counts and a lot of actions)? Thanks for input! --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org