Re: How DataFrame schema migration works ?

Jaonary Rabarisoa Tue, 14 Apr 2015 08:13:43 -0700

I forgot to mention that the imageId field is a custom scala object. Do I
need to implement some special method to make it works (equal, hashCode ) ?


On Tue, Apr 14, 2015 at 5:00 PM, Jaonary Rabarisoa <jaon...@gmail.com>
wrote:

> Dear all,
>
> In the latest version of spark there's a feature called : automatic
> partition discovery and Schema migration for parquet. As far as I know,
> this gives the ability to split the DataFrame into several parquet files,
> and by just loading the parent directory one can get the global schema of
> the parent DataFrame.
>
> I'm trying to use this feature in the following problem but I get some
> troubles. I want to perfom a serie of feature of extraction for a set of
> images. At a first step, my DataFrame has just two columns : imageId,
> imageRawData. Then I transform the imageRowData column with different image
> feature extractors. The result can be of different types. For example on
> feature could be a mllib.Vector, and another one could be an Array[Byte].
> Each feature extractor store its output as a parquet file with two columns
> : imageId, featureType. Then, at the end, I get the following files :
>
> - features/rawData.parquet
> - features/feature1.parquet
> - features/feature2.parquet
>
> When I load all the features with :
>
> sqlContext.load("features")
>
> It seems to works and I get with this example a DataFrame with 4 columns :
> imageId, imageRawData, feature1, feature2.
> But, when I try to read the values, for example with show, some columns
> have null fields and I just can't figure out what's going wrong.
>
> Any ideas ?
>
>
> Best,
>
>
> Jao
>

Re: How DataFrame schema migration works ?

Reply via email to