Re: Dataset and JavaRDD: how to eliminate the header.

Carlo . Allocca Wed, 03 Aug 2016 10:59:02 -0700

Thanks Mich.

Yes, I know both headers (categoryRankSchema, categorySchema ) as expressed 
below:


        this.dataset1 = 
d1_DFR.schema(categoryRankSchema).csv(categoryrankFilePath);

       this.dataset2 = d2_DFR.schema(categorySchema).csv(categoryFilePath);

Can you use filter to get rid of the header from both CSV files before joining 
them?
Well I can give a try. But my case is a bit more complex than joining two 
datasets. It joins data from at least six datasets which are processed 
separately (to clean, and extract the need info) and, only at the end I do join 
of three datasets for which I know the headers.

I do believe that there should be anther way to achieve my goal.

Any other suggestion would be very appreciated.

Many Thanks.
Best Regards,
carlo



On 3 Aug 2016, at 18:45, Mich Talebzadeh 
<mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote:

Do you know the headers?

Can you use filter to get rid of the header from both CSV files before joining 
them?



Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 3 August 2016 at 18:32, Carlo.Allocca 
<carlo.allo...@open.ac.uk<mailto:carlo.allo...@open.ac.uk>> wrote:
Hi Aseem,

Thank you very much for your help.

Please, allow me to be more specific for my case (to some extent I already do 
what you suggested):

Let us imagine that I two csv datasets d1 and d2. I generate the Dataset<Row> 
as in the following:

== Reading d1:

sparkSession=spark;

        options = new HashMap();
        options.put("header", "true");
        options.put("delimiter", delimiter);
        options.put("nullValue", nullValue);
        DataFrameReader d1_DFR = spark.read().options(options);
        this.dataset1 = 
d1_DFR.schema(categoryRankSchema).csv(categoryrankFilePath);

== Reading d2

sparkSession=spark;

        options = new HashMap();
        options.put("header", "true");
        options.put("delimiter", delimiter);
        options.put("nullValue", nullValue);
        DataFrameReader d2_DFR = spark.read().options(options);
        this.dataset2 = 
d2_DFR.schema(categoryRankSchema).csv(categoryrankFilePath);


So far, I have the header set to true.

Now, let us imagine that we need to do a Join between the two dataset:

Dataset<Row> dataset1_Join_dataset2 = dataset1.join(dataset2, “some condition”);

All the below process, Step1, Step2 and Step3, starts from 
dataset1_Join_dataset2. And, in particular, I realised that the steps

== Step 1: transform the Dataset<Row>  into JavaRDD<Row>
        JavaRDD<Row> dataPointsWithHeader =dataset1_Join_dataset2.toJavaRDD();

== Step 2: take the first row (I was thinking that it was the header)
Row header= dataPointsWithHeader.first();

the header is not the first().

 So my question still is:

Is the an efficient way to access to the header and eliminate it ?

Many Thanks in advance for your support.

Best Regards,
Carlo




On 3 Aug 2016, at 18:13, Aseem Bansal 
<asmbans...@gmail.com<mailto:asmbans...@gmail.com>> wrote:

Hi

Depending on how how you reading the data in the first place, can you  simply 
use the header as header instead of a row?

http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv(scala.collection.Seq)

See the header option

On Wed, Aug 3, 2016 at 10:14 PM, Carlo.Allocca 
<carlo.allo...@open.ac.uk<mailto:carlo.allo...@open.ac.uk>> wrote:
Hi All,

I would like to apply a  regression to my data. One of the workflow is the 
prepare my data as a JavaRDD<LabeledPoint>  starting from a Dataset<Row> with 
its header.  So, what I did was the following:

== Step 1: transform the Dataset<Row>  into JavaRDD<Row>
        JavaRDD<Row> dataPointsWithHeader =modelDS.toJavaRDD();


== Step 2: take the first row (I was thinking that it was the header)
Row header= dataPointsWithHeader.first();

== Step 3: eliminate the row header by
JavaRDD<Row> dataPointsWithoutHeader = dataPointsWithHeader.filter((Row row) -> 
{
                return !row.equals(header);
            });

The issue with the above approach are:

a) the result of the Step 2 is not the header row;
b) the application of the Step 3 is very inefficient in case there is a way to 
access to the header.

My question is:

Is the an efficient way to access to the header and eliminate it ?

Many Thanks in advance for your help and suggestion.

Regards,
Carlo
-- The Open University is incorporated by Royal Charter (RC 000391), an exempt 
charity in England & Wales and a charity registered in Scotland (SC 038302). 
The Open University is authorised and regulated by the Financial Conduct 
Authority.

---------------------------------------------------------------------
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

Re: Dataset and JavaRDD: how to eliminate the header.

Reply via email to