Thank you very much.
On Mon, Jun 20, 2016 at 3:38 PM, Jörn Franke wrote:
> If you insert the data sorted then there is not need to bucket the data.
> You can even create an index in Spark. Simply set the outputformat
> configuration orc.create.index = true
>
>
> On 20 Jun 2016, at 09:10, Mich Ta
Thank you very much.
On Mon, Jun 20, 2016 at 3:10 PM, Mich Talebzadeh
wrote:
> Right, you concern is that you expect storeindex in ORC file to help the
> optimizer.
>
> Frankly I do not know what
> write().mode(SaveMode.Overwrite).orc("orcFileToRead" does actually under
> the bonnet. From my exp
If you insert the data sorted then there is not need to bucket the data.
You can even create an index in Spark. Simply set the outputformat
configuration orc.create.index = true
> On 20 Jun 2016, at 09:10, Mich Talebzadeh wrote:
>
> Right, you concern is that you expect storeindex in ORC fil
Right, you concern is that you expect storeindex in ORC file to help the
optimizer.
Frankly I do not know what
write().mode(SaveMode.Overwrite).orc("orcFileToRead" does actually under
the bonnet. From my experience in order for ORC index to be used you need
to bucket the table. I have explained th
Hi Mich,
Thank you for your reply.
Let me explain more clearly.
File with 100 records needs to joined with a Big lookup File created in ORC
format (500 million records). The Spark process i wrote is returing back
the matching records and is working fine. My concern is that it loads the
entire fi
Hi,
To start when you store the data in ORC file can you verify that the data
is there?
For example register it as tempTable
processDF.register("tmp")
sql("select count(1) from tmp).show
Also what do you mean by index file in ORC?
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedi
I am trying to join a Dataframe(say 100 records) with an ORC file with 500
million records through Spark(can increase to 4-5 billion, 25 bytes each
record).
I used Spark hiveContext API.
*ORC File Creation Code*
//fsdtRdd is JavaRDD, fsdtSchema is StructType schema
DataFrame fsdtDf = hiveContext