Re: union of SchemaRDDs

2014-11-01 Thread Matei Zaharia
It does generalize types, but only on the intersection of the columns it seems. 
There might be a way to get the union of the columns too using HiveQL. Types 
generalize up with string being the "most general".

Matei

> On Nov 1, 2014, at 6:22 PM, Daniel Mahler  wrote:
> 
> Thanks Matei. What does unionAll do if the input RDD schemas are not 100% 
> compatible. Does it take the union of the columns and generalize the types?
> 
> thanks
> Daniel
> 
> On Sat, Nov 1, 2014 at 6:08 PM, Matei Zaharia  > wrote:
> Try unionAll, which is a special method on SchemaRDDs that keeps the schema 
> on the results.
> 
> Matei
> 
> > On Nov 1, 2014, at 3:57 PM, Daniel Mahler  > > wrote:
> >
> > I would like to combine 2 parquet tables I have create.
> > I tried:
> >
> >   sc.union(sqx.parquetFile("fileA"), sqx.parquetFile("fileB"))
> >
> > but that just returns RDD[Row].
> > How do I combine them to get a SchemaRDD[Row]?
> >
> > thanks
> > Daniel
> 
> 



Re: union of SchemaRDDs

2014-11-01 Thread Daniel Mahler
Thanks Matei. What does unionAll do if the input RDD schemas are not 100%
compatible. Does it take the union of the columns and generalize the types?

thanks
Daniel

On Sat, Nov 1, 2014 at 6:08 PM, Matei Zaharia 
wrote:

> Try unionAll, which is a special method on SchemaRDDs that keeps the
> schema on the results.
>
> Matei
>
> > On Nov 1, 2014, at 3:57 PM, Daniel Mahler  wrote:
> >
> > I would like to combine 2 parquet tables I have create.
> > I tried:
> >
> >   sc.union(sqx.parquetFile("fileA"), sqx.parquetFile("fileB"))
> >
> > but that just returns RDD[Row].
> > How do I combine them to get a SchemaRDD[Row]?
> >
> > thanks
> > Daniel
>
>


Re: union of SchemaRDDs

2014-11-01 Thread Matei Zaharia
Try unionAll, which is a special method on SchemaRDDs that keeps the schema on 
the results.

Matei

> On Nov 1, 2014, at 3:57 PM, Daniel Mahler  wrote:
> 
> I would like to combine 2 parquet tables I have create.
> I tried:
> 
>   sc.union(sqx.parquetFile("fileA"), sqx.parquetFile("fileB"))
> 
> but that just returns RDD[Row].
> How do I combine them to get a SchemaRDD[Row]?
> 
> thanks
> Daniel


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



union of SchemaRDDs

2014-11-01 Thread Daniel Mahler
I would like to combine 2 parquet tables I have create.
I tried:

  sc.union(sqx.parquetFile("fileA"), sqx.parquetFile("fileB"))

but that just returns RDD[Row].
How do I combine them to get a SchemaRDD[Row]?

thanks
Daniel


Re: Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-31 Thread Michael Armbrust
>
> * unionAll preserve duplicate v/s union that does not
>

This is true, if you want to eliminate duplicate items you should follow
the union with a distinct()


> * SQL union and unionAll result in same output format i.e. another SQL v/s
> different RDD types here.
>
* Understand the existing union contract issue. This may be a class
> hierarchy discussion for SchemaRDD, UnionRDD etc. ?
>

This is unfortunately going to be a limitation of the query DSL since it
extends standard RDDs.  It is not possible for us to return specialized
types from functions that are already defined in RDD (such as union) as the
base RDD class has a very opaque notion of schema, and at this point the
API for RDDs is very fixed.  If you use SQL however, you will always get
back SchemaRDDs.


Re: Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-30 Thread Manoj Samel
Hi Aaron,

unionAll is a workaround ...

* unionAll preserve duplicate v/s union that does not
* SQL union and unionAll result in same output format i.e. another SQL v/s
different RDD types here.
* Understand the existing union contract issue. This may be a class
hierarchy discussion for SchemaRDD, UnionRDD etc. ?

Thanks,




On Sun, Mar 30, 2014 at 11:08 AM, Aaron Davidson  wrote:

> Looks like there is a "unionAll" function on SchemaRDD which will do what
> you want. The contract of RDD#union is unfortunately too general to allow
> it to return a SchemaRDD without downcasting.
>
>
> On Sun, Mar 30, 2014 at 7:56 AM, Manoj Samel wrote:
>
>> Hi,
>>
>> I am trying SparkSQL based on the example on doc ...
>>
>> 
>>
>> val people =
>> sc.textFile("/data/spark/examples/src/main/resources/people.txt").map(_.split(",")).map(p
>> => Person(p(0), p(1).trim.toInt))
>>
>>
>> val olderThanTeans = people.where('age > 19)
>> val youngerThanTeans = people.where('age < 13)
>> val nonTeans = youngerThanTeans.union(olderThanTeans)
>>
>> I can do a orderBy('age) on first two (which are SchemaRDD) but not on
>> third. The nonTeans is a UnionRDD that does not supports orderBy. This
>> seems different than the SQL behavior where results of 2 SQL unions is a
>> SQL itself with same functionality ...
>>
>> Not clear why union of 2 SchemaRDDs does not produces a SchemaRDD 
>>
>>
>> Thanks,
>>
>>
>>
>


Re: Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-30 Thread Aaron Davidson
Looks like there is a "unionAll" function on SchemaRDD which will do what
you want. The contract of RDD#union is unfortunately too general to allow
it to return a SchemaRDD without downcasting.


On Sun, Mar 30, 2014 at 7:56 AM, Manoj Samel wrote:

> Hi,
>
> I am trying SparkSQL based on the example on doc ...
>
> 
>
> val people =
> sc.textFile("/data/spark/examples/src/main/resources/people.txt").map(_.split(",")).map(p
> => Person(p(0), p(1).trim.toInt))
>
>
> val olderThanTeans = people.where('age > 19)
> val youngerThanTeans = people.where('age < 13)
> val nonTeans = youngerThanTeans.union(olderThanTeans)
>
> I can do a orderBy('age) on first two (which are SchemaRDD) but not on
> third. The nonTeans is a UnionRDD that does not supports orderBy. This
> seems different than the SQL behavior where results of 2 SQL unions is a
> SQL itself with same functionality ...
>
> Not clear why union of 2 SchemaRDDs does not produces a SchemaRDD 
>
>
> Thanks,
>
>
>


Shouldn't the UNION of SchemaRDDs produce SchemaRDD ?

2014-03-30 Thread Manoj Samel
Hi,

I am trying SparkSQL based on the example on doc ...



val people =
sc.textFile("/data/spark/examples/src/main/resources/people.txt").map(_.split(",")).map(p
=> Person(p(0), p(1).trim.toInt))


val olderThanTeans = people.where('age > 19)
val youngerThanTeans = people.where('age < 13)
val nonTeans = youngerThanTeans.union(olderThanTeans)

I can do a orderBy('age) on first two (which are SchemaRDD) but not on
third. The nonTeans is a UnionRDD that does not supports orderBy. This
seems different than the SQL behavior where results of 2 SQL unions is a
SQL itself with same functionality ...

Not clear why union of 2 SchemaRDDs does not produces a SchemaRDD 


Thanks,