I am trying to access a mid-size Teradata table (~100 million rows) via
JDBC in standalone mode on a single node (local[*]). When I tried with BIG
table (5B records) then no results returned upon completion of query.
I am using Spark 1.4.1. and is setup on a very powerful machine(2 cpu, 24
cores, 126G RAM).
I have tried several memory setup and tuning options to make it work
faster, but neither of them made a huge impact.
I am sure there is something I am missing and below is my final try that
took about 11 minutes to get this simple counts vs it only took 40 seconds
using a JDBC connection through R to get the counts.
bin/pyspark --driver-memory 40g --executor-memory 40g
df = sqlContext.read.jdbc("jdbc:teradata://......)
df.count()
[image: Inline image 1]