Delete checkpointed data for a single dataset?

2019-10-23 Thread Isabelle Phan
Hello, In a non streaming application, I am using the checkpoint feature to truncate the lineage of complex datasets. At the end of the job, the checkpointed data, which is stored in HDFS, is deleted. I am looking for a way to delete the unused checkpointed data earlier than the end of the job. If

Read ORC file with subset of schema

2019-08-30 Thread Isabelle Phan
Hello, When reading an older ORC file where the schema is a subset of the current schema, reader throws an error. Please see sample code below (ran on spark 2.1). The same commands on a parquet file do not error out, they return the new column with null values. Is there a setting to add to the rea

ClassCastException when using SparkSQL Window function

2016-11-17 Thread Isabelle Phan
Hello, I have a simple session table, which tracks pages users visited with a sessionId. I would like to apply a window function by sessionId, but am hitting a type cast exception. I am using Spark 1.5.0. Here is sample code: scala> df.printSchema root |-- sessionid: string (nullable = true) |-

Re: DataFrame creation delay?

2015-12-11 Thread Isabelle Phan
rom temp.log"? Adding a > where clause to the query with a partition condition will help Spark prune > the request to just the required partitions (vs. all, which is proving > expensive). > > On Fri, Dec 11, 2015 at 3:59 AM Isabelle Phan wrote: > >> Hi Michael, >

Re: DataFrame creation delay?

2015-12-10 Thread Isabelle Phan
lot for your help, Isabelle On Fri, Sep 4, 2015 at 1:42 PM, Michael Armbrust wrote: > If you run sqlContext.table("...").registerTempTable("...") that > temptable will cache the lookup of partitions. > > On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan wrote: > &g

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-04 Thread Isabelle Phan
ion “/test/table/“ on HDFS >>> and has partitions: >>> >>> “/test/table/dt=2012” >>> “/test/table/dt=2013” >>> >>> df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table") >>> >>> >>> >>&

SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Isabelle Phan
Hello, Is there any API to insert data into a single partition of a table? Let's say I have a table with 2 columns (col_a, col_b) and a partition by date. After doing some computation for a specific date, I have a DataFrame with 2 columns (col_a, col_b) which I would like to insert into a specifi

How to catch error during Spark job?

2015-10-27 Thread Isabelle Phan
Hello, I had a question about error handling in Spark job: if an exception occurs during the job, what is the best way to get notification of the failure? Can Spark jobs return with different exit codes? For example, I wrote a dummy Spark job just throwing out an Exception, as follows: import org

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Isabelle Phan
gt; specific column in a what that breaks when we do the rewrite of one side of > the query. Using the apply method constructs a resolved column eagerly > (which looses the alias information). > > On Wed, Oct 21, 2015 at 1:49 PM, Isabelle Phan wrote: > >> Thanks Michael and Ali

Re: How to distinguish columns when joining DataFrames with shared parent?

2015-10-21 Thread Isabelle Phan
Thanks Michael and Ali for the reply! I'll make sure to use unresolved columns when working with self joins then. As pointed by Ali, isn't there still an issue with the aliasing? It works when using org.apache.spark.sql.functions.col(colName: String) method, but not when using org.apache.spark.sq

How to distinguish columns when joining DataFrames with shared parent?

2015-10-20 Thread Isabelle Phan
Hello, When joining 2 DataFrames which originate from the same initial DataFrame, why can't org.apache.spark.sql.DataFrame.apply(colName: String) method distinguish which column to read? Let me illustrate this question with a simple example (ran on Spark 1.5.1): //my initial DataFrame scala> df

Re: DataFrame creation delay?

2015-09-04 Thread Isabelle Phan
PM, Michael Armbrust > wrote: > >> What format is this table. For parquet and other optimized formats we >> cache a bunch of file metadata on first access to make interactive queries >> faster. >> >> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan wrote: >>

Re: How to determine the value for spark.sql.shuffle.partitions?

2015-09-03 Thread Isabelle Phan
+1 I had the exact same question as I am working on my first Spark applications. Hope someone can share some best practices. Thanks! Isabelle On Tue, Sep 1, 2015 at 2:17 AM, Romi Kuntsman wrote: > Hi all, > > The number of partition greatly affect the speed and efficiency of > calculation, in

DataFrame creation delay?

2015-09-03 Thread Isabelle Phan
Hello, I am using SparkSQL to query some Hive tables. Most of the time, when I create a DataFrame using sqlContext.sql("select * from table") command, DataFrame creation is less than 0.5 second. But I have this one table with which it takes almost 12 seconds! scala> val start = scala.compat.Plat

DataFrame rollup with alias?

2015-08-23 Thread Isabelle Phan
Hello, I am new to Spark and just running some tests to get familiar with the APIs. When calling the rollup function on my DataFrame, I get different results when I alias the columns I am grouping on (see below for example data set). I was expecting alias function to only affect the column name.