Re: Spark - “min key = null, max key = null” while reading ORC file

2016-06-20 Thread Mohanraj Ragupathiraj
Thank you very much. On Mon, Jun 20, 2016 at 3:38 PM, Jörn Franke wrote: > If you insert the data sorted then there is not need to bucket the data. > You can even create an index in Spark. Simply set the outputformat > configuration orc.create.index = true > > > On 20 Jun 2016, at 09:10, Mich Ta

Re: Spark - “min key = null, max key = null” while reading ORC file

2016-06-20 Thread Mohanraj Ragupathiraj
Thank you very much. On Mon, Jun 20, 2016 at 3:10 PM, Mich Talebzadeh wrote: > Right, you concern is that you expect storeindex in ORC file to help the > optimizer. > > Frankly I do not know what > write().mode(SaveMode.Overwrite).orc("orcFileToRead" does actually under > the bonnet. From my exp

Re: Spark - “min key = null, max key = null” while reading ORC file

2016-06-20 Thread Jörn Franke
If you insert the data sorted then there is not need to bucket the data. You can even create an index in Spark. Simply set the outputformat configuration orc.create.index = true > On 20 Jun 2016, at 09:10, Mich Talebzadeh wrote: > > Right, you concern is that you expect storeindex in ORC fil

Re: Spark - “min key = null, max key = null” while reading ORC file

2016-06-20 Thread Mich Talebzadeh
Right, you concern is that you expect storeindex in ORC file to help the optimizer. Frankly I do not know what write().mode(SaveMode.Overwrite).orc("orcFileToRead" does actually under the bonnet. From my experience in order for ORC index to be used you need to bucket the table. I have explained th

Re: Spark - “min key = null, max key = null” while reading ORC file

2016-06-19 Thread Mohanraj Ragupathiraj
Hi Mich, Thank you for your reply. Let me explain more clearly. File with 100 records needs to joined with a Big lookup File created in ORC format (500 million records). The Spark process i wrote is returing back the matching records and is working fine. My concern is that it loads the entire fi

Re: Spark - “min key = null, max key = null” while reading ORC file

2016-06-19 Thread Mich Talebzadeh
Hi, To start when you store the data in ORC file can you verify that the data is there? For example register it as tempTable processDF.register("tmp") sql("select count(1) from tmp).show Also what do you mean by index file in ORC? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedi

Spark - “min key = null, max key = null” while reading ORC file

2016-06-19 Thread Mohanraj Ragupathiraj
I am trying to join a Dataframe(say 100 records) with an ORC file with 500 million records through Spark(can increase to 4-5 billion, 25 bytes each record). I used Spark hiveContext API. *ORC File Creation Code* //fsdtRdd is JavaRDD, fsdtSchema is StructType schema DataFrame fsdtDf = hiveContext