PySpark 2: Kmeans The input data is not directly cached

Zakaria Hili Thu, 03 Nov 2016 09:17:02 -0700

Hi,

I dont know why I receive the message


 WARN KMeans: The input data is not directly cached, which may hurt
performance if its parent RDDs are also uncached.

when I try to use Spark Kmeans

df_Part = assembler.transform(df_Part)
df_Part.cache()while (k<=max_cluster) and (wssse > seuilStop):
                    kmeans = KMeans().setK(k)
                    model = kmeans.fit(df_Part)
                    wssse = model.computeCost(df_Part)
                    k=k+1

It says that my input (Dataframe) is not cached !!

I tried to print df_Part.is_cached and I recieved True which means that my
dataframe is cached, So why spark still warning me about this ???

thank you in advance


ᐧ

PySpark 2: Kmeans The input data is not directly cached

Reply via email to