Re: Spark 2.x Core: .setMaster(local[*]) output is different from spark-submit

2018-03-17 Thread klrmowse
for clarification... .saveAsTextFile(rdd) writes to local fs, but not hdfs anyone? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Append more files to existing partitioned data

2018-03-17 Thread Serega Sheypak
Hi Denis, great to see you here :) It works, thanks! Do you know how spark generates datafile names? names look like part- with uuid appended after part-0-124a8c43-83b9-44e1-a9c4-dcc8676cdb99.c000.snappy.parquet 2018-03-17 14:15 GMT+01:00 Denis Bolshakov :

Re: Append more files to existing partitioned data

2018-03-17 Thread Denis Bolshakov
Hello Serega, https://spark.apache.org/docs/latest/sql-programming-guide.html Please try SaveMode.Append option. Does it work for you? сб, 17 мар. 2018 г., 15:19 Serega Sheypak : > Hi, I', using spark-sql to process my data and store result as parquet > partitioned

Append more files to existing partitioned data

2018-03-17 Thread Serega Sheypak
Hi, I', using spark-sql to process my data and store result as parquet partitioned by several columns ds.write .partitionBy("year", "month", "day", "hour", "workflowId") .parquet("/here/is/my/dir") I want to run more jobs that will produce new partitions or add more files to existing

Re: how "hour" function in Spark SQL is supposed to work?

2018-03-17 Thread Serega Sheypak
> Not sure why you are dividing by 1000. from_unixtime expects a long type It expects seconds, I have milliseconds. 2018-03-12 6:16 GMT+01:00 vermanurag : > Not sure why you are dividing by 1000. from_unixtime expects a long type > which is time in milliseconds

Dataframe size using RDDStorageInfo objects

2018-03-17 Thread Bahubali Jain
Hi, I am trying to figure out a way to find the size of *persisted *dataframes using the *sparkContext.getRDDStorageInfo() * RDDStorageInfo object has information related to the number of bytes stored in memory and on disk. For eg: I have 3 dataframes which i have cached. df1.cache() df2.cache()