Thanks Michael, much appreciated! Nothing should be held in memory for a query like this (other than a single count per partition), so I don't think that is the problem. There is likely an error buried somewhere.
For your above comments - I don't get any error but just get the NULL as return value. I have tried digging deeper in the logs etc but couldn't spot anything. Is there any other suggestions to spot such buried errors? Thanks, Dhaval On Mon, Aug 24, 2015 at 6:38 PM, Michael Armbrust <mich...@databricks.com> wrote: > Much appreciated! I am not comparing with "select count(*)" for >> performance, but it was one simple thing I tried to check the performance >> :). I think it now makes sense since Spark tries to extract all records >> before doing the count. I thought having an aggregated function query >> submitted over JDBC/Teradata would let Teradata do the heavy lifting. >> > > We currently only push down filters since there is a lot of variability in > what types of aggregations various databases support. You can manually > pushdown whatever you want by replacing the table name with a subquery > (i.e. "(SELECT ... FROM ...)") > > - How come my second query for (5B) records didn't return anything >> even after a long processing? If I understood correctly, Spark would try to >> fit it in memory and if not then might use disk space, which I have >> available? >> > > Nothing should be held in memory for a query like this (other than a > single count per partition), so I don't think that is the problem. There > is likely an error buried somewhere. > > >> - Am I supposed to do any Spark related tuning to make it work? >> >> My main need is to access data from these large table(s) on demand and >> provide aggregated and calculated results much quicker, for that I was >> trying out Spark. Next step I am thinking to export data in Parque files >> and give it a try. Do you have any suggestions for to deal with the problem? >> > > Exporting to parquet will likely be a faster option that trying to query > through JDBC, since we have many more opportunities for parallelism here. >