Hello,
In a non streaming application, I am using the checkpoint feature to
truncate the lineage of complex datasets. At the end of the job, the
checkpointed data, which is stored in HDFS, is deleted.
I am looking for a way to delete the unused checkpointed data earlier than
the end of the job. If
Hello,
When reading an older ORC file where the schema is a subset of the current
schema, reader throws an error. Please see sample code below (ran on spark
2.1).
The same commands on a parquet file do not error out, they return the new
column with null values.
Is there a setting to add to the rea
Hello,
I have a simple session table, which tracks pages users visited with a
sessionId. I would like to apply a window function by sessionId, but am
hitting a type cast exception. I am using Spark 1.5.0.
Here is sample code:
scala> df.printSchema
root
|-- sessionid: string (nullable = true)
|-
rom temp.log"? Adding a
> where clause to the query with a partition condition will help Spark prune
> the request to just the required partitions (vs. all, which is proving
> expensive).
>
> On Fri, Dec 11, 2015 at 3:59 AM Isabelle Phan wrote:
>
>> Hi Michael,
>
lot for your help,
Isabelle
On Fri, Sep 4, 2015 at 1:42 PM, Michael Armbrust
wrote:
> If you run sqlContext.table("...").registerTempTable("...") that
> temptable will cache the lookup of partitions.
>
> On Fri, Sep 4, 2015 at 1:16 PM, Isabelle Phan wrote:
>
&g
ion “/test/table/“ on HDFS
>>> and has partitions:
>>>
>>> “/test/table/dt=2012”
>>> “/test/table/dt=2013”
>>>
>>> df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table")
>>>
>>>
>>>
>>&
Hello,
Is there any API to insert data into a single partition of a table?
Let's say I have a table with 2 columns (col_a, col_b) and a partition by
date.
After doing some computation for a specific date, I have a DataFrame with 2
columns (col_a, col_b) which I would like to insert into a specifi
Hello,
I had a question about error handling in Spark job: if an exception occurs
during the job, what is the best way to get notification of the failure?
Can Spark jobs return with different exit codes?
For example, I wrote a dummy Spark job just throwing out an Exception, as
follows:
import org
gt; specific column in a what that breaks when we do the rewrite of one side of
> the query. Using the apply method constructs a resolved column eagerly
> (which looses the alias information).
>
> On Wed, Oct 21, 2015 at 1:49 PM, Isabelle Phan wrote:
>
>> Thanks Michael and Ali
Thanks Michael and Ali for the reply!
I'll make sure to use unresolved columns when working with self joins then.
As pointed by Ali, isn't there still an issue with the aliasing? It works
when using org.apache.spark.sql.functions.col(colName: String) method, but
not when using org.apache.spark.sq
Hello,
When joining 2 DataFrames which originate from the same initial DataFrame,
why can't org.apache.spark.sql.DataFrame.apply(colName: String) method
distinguish which column to read?
Let me illustrate this question with a simple example (ran on Spark 1.5.1):
//my initial DataFrame
scala> df
PM, Michael Armbrust
> wrote:
>
>> What format is this table. For parquet and other optimized formats we
>> cache a bunch of file metadata on first access to make interactive queries
>> faster.
>>
>> On Thu, Sep 3, 2015 at 8:17 PM, Isabelle Phan wrote:
>>
+1
I had the exact same question as I am working on my first Spark
applications.
Hope someone can share some best practices.
Thanks!
Isabelle
On Tue, Sep 1, 2015 at 2:17 AM, Romi Kuntsman wrote:
> Hi all,
>
> The number of partition greatly affect the speed and efficiency of
> calculation, in
Hello,
I am using SparkSQL to query some Hive tables. Most of the time, when I
create a DataFrame using sqlContext.sql("select * from table") command,
DataFrame creation is less than 0.5 second.
But I have this one table with which it takes almost 12 seconds!
scala> val start = scala.compat.Plat
Hello,
I am new to Spark and just running some tests to get familiar with the APIs.
When calling the rollup function on my DataFrame, I get different results
when I alias the columns I am grouping on (see below for example data set).
I was expecting alias function to only affect the column name.
15 matches
Mail list logo