re, the usage of *cacheTable* will affect ONLY the *sqlContext.sql*
> query.
>
>> sqlContext.cacheTable("myData")
>
> sqlContext.sql("SELECT col1, col2 FROM myData").show()
>
>
> Thanks,
> Kevin
>
> On Fri, Jan 15, 2016 at 7:00 AM, George Si
According to the documentation they are exactly the same, but in my queries
dataFrame.cache()
results in much faster execution times vs doing
sqlContext.cacheTable("tableName")
Is there any explanation about this? I am not caching the RDD prior to
creating the dataframe. Using Pyspark on Spark
Hello,
In a 2-worker cluster: 6 cores/30 GB RAM, 24cores/60GB RAM,
how can I tell my executor to use all 90 GB of available memory?
In the configuration you set e.g. "spark.cores.max" to 30 (24+6),
but cannot set "spark.executor.memory" to 90g (30+60).
Kind regards,
George
Hello,
Does anybody know how to copy a cassandra table (or an entire keyspace)
from one cluster to another using Spark? I haven't found anything very
specific about this so far.
Thank you,
George
Found the problem. Control-M characters. Please ignore the post
On Wed, Nov 25, 2015 at 6:06 PM, George Sigletos
wrote:
> Hello,
>
> I have a text file consisting of 483150 lines (wc -l "my_file.txt").
>
> However when I read it using textFile:
>
> %pyspark
&
Hello,
I have a text file consisting of 483150 lines (wc -l "my_file.txt").
However when I read it using textFile:
%pyspark
rdd = sc.textFile("my_file.txt")
print rdd.count()
it returns 554420 lines. Any idea why this is happening? Is it using a
different new line delimiter and how this can be