Thanks Mich. Yes, I know both headers (categoryRankSchema, categorySchema ) as expressed below:
this.dataset1 = d1_DFR.schema(categoryRankSchema).csv(categoryrankFilePath); this.dataset2 = d2_DFR.schema(categorySchema).csv(categoryFilePath); Can you use filter to get rid of the header from both CSV files before joining them? Well I can give a try. But my case is a bit more complex than joining two datasets. It joins data from at least six datasets which are processed separately (to clean, and extract the need info) and, only at the end I do join of three datasets for which I know the headers. I do believe that there should be anther way to achieve my goal. Any other suggestion would be very appreciated. Many Thanks. Best Regards, carlo On 3 Aug 2016, at 18:45, Mich Talebzadeh <mich.talebza...@gmail.com<mailto:mich.talebza...@gmail.com>> wrote: Do you know the headers? Can you use filter to get rid of the header from both CSV files before joining them? Dr Mich Talebzadeh LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On 3 August 2016 at 18:32, Carlo.Allocca <carlo.allo...@open.ac.uk<mailto:carlo.allo...@open.ac.uk>> wrote: Hi Aseem, Thank you very much for your help. Please, allow me to be more specific for my case (to some extent I already do what you suggested): Let us imagine that I two csv datasets d1 and d2. I generate the Dataset<Row> as in the following: == Reading d1: sparkSession=spark; options = new HashMap(); options.put("header", "true"); options.put("delimiter", delimiter); options.put("nullValue", nullValue); DataFrameReader d1_DFR = spark.read().options(options); this.dataset1 = d1_DFR.schema(categoryRankSchema).csv(categoryrankFilePath); == Reading d2 sparkSession=spark; options = new HashMap(); options.put("header", "true"); options.put("delimiter", delimiter); options.put("nullValue", nullValue); DataFrameReader d2_DFR = spark.read().options(options); this.dataset2 = d2_DFR.schema(categoryRankSchema).csv(categoryrankFilePath); So far, I have the header set to true. Now, let us imagine that we need to do a Join between the two dataset: Dataset<Row> dataset1_Join_dataset2 = dataset1.join(dataset2, “some condition”); All the below process, Step1, Step2 and Step3, starts from dataset1_Join_dataset2. And, in particular, I realised that the steps == Step 1: transform the Dataset<Row> into JavaRDD<Row> JavaRDD<Row> dataPointsWithHeader =dataset1_Join_dataset2.toJavaRDD(); == Step 2: take the first row (I was thinking that it was the header) Row header= dataPointsWithHeader.first(); the header is not the first(). So my question still is: Is the an efficient way to access to the header and eliminate it ? Many Thanks in advance for your support. Best Regards, Carlo On 3 Aug 2016, at 18:13, Aseem Bansal <asmbans...@gmail.com<mailto:asmbans...@gmail.com>> wrote: Hi Depending on how how you reading the data in the first place, can you simply use the header as header instead of a row? http://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameReader.html#csv(scala.collection.Seq) See the header option On Wed, Aug 3, 2016 at 10:14 PM, Carlo.Allocca <carlo.allo...@open.ac.uk<mailto:carlo.allo...@open.ac.uk>> wrote: Hi All, I would like to apply a regression to my data. One of the workflow is the prepare my data as a JavaRDD<LabeledPoint> starting from a Dataset<Row> with its header. So, what I did was the following: == Step 1: transform the Dataset<Row> into JavaRDD<Row> JavaRDD<Row> dataPointsWithHeader =modelDS.toJavaRDD(); == Step 2: take the first row (I was thinking that it was the header) Row header= dataPointsWithHeader.first(); == Step 3: eliminate the row header by JavaRDD<Row> dataPointsWithoutHeader = dataPointsWithHeader.filter((Row row) -> { return !row.equals(header); }); The issue with the above approach are: a) the result of the Step 2 is not the header row; b) the application of the Step 3 is very inefficient in case there is a way to access to the header. My question is: Is the an efficient way to access to the header and eliminate it ? Many Thanks in advance for your help and suggestion. Regards, Carlo -- The Open University is incorporated by Royal Charter (RC 000391), an exempt charity in England & Wales and a charity registered in Scotland (SC 038302). The Open University is authorised and regulated by the Financial Conduct Authority. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>