If you insert the data sorted then there is not need to bucket the data. 
You can even create an index in Spark. Simply set the outputformat 
configuration orc.create.index = true 


> On 20 Jun 2016, at 09:10, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> Right, you concern is that you expect storeindex in ORC file to help the 
> optimizer.
> 
> Frankly I do not know what 
> write().mode(SaveMode.Overwrite).orc("orcFileToRead" does actually under the 
> bonnet. From my experience in order for ORC index to be used you need to 
> bucket the table. I have explained these before in here
> 
> Now it is possible that you have not updated statistics on the table
> 
> Even with Spark I tend to create my ORC table explicitly through Spark SQL.
> 
> You stated the join scans all the underlying ORC table. Your "id" column I 
> assume is unique. So I would bucket it using id column.
> 
> 
> HTH
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 20 June 2016 at 07:07, Mohanraj Ragupathiraj <mohanaug...@gmail.com> 
>> wrote:
>> Hi Mich,
>> 
>> Thank you for your reply. 
>> 
>> Let me explain more clearly.
>> 
>> File with 100 records needs to joined with a Big lookup File created in ORC 
>> format (500 million records). The Spark process i wrote is returing back the 
>> matching records and is working fine. My concern is that it loads the entire 
>> file (500 million) and matches with the 100 records instead of loading only 
>> the stripes with matching keys. I read that ORC file provides indexes 
>> (https://orc.apache.org/docs/indexes.html) and i assumned that when i join 
>> using Dataframes, the indexes will be used, resulting in loading of only 
>> matching records/stripes for processing instead of the whole table.
>> 
>>> On Mon, Jun 20, 2016 at 1:00 PM, Mich Talebzadeh 
>>> <mich.talebza...@gmail.com> wrote:
>>> Hi,
>>> 
>>> To start when you store the data in ORC file can you verify that the data 
>>> is there?
>>> 
>>> For example register it as tempTable
>>> 
>>> processDF.register("tmp")
>>> sql("select count(1) from tmp).show
>>> 
>>> Also what do you mean by index file in ORC?
>>> 
>>> HTH
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
>>>> On 20 June 2016 at 05:01, Mohanraj Ragupathiraj <mohanaug...@gmail.com> 
>>>> wrote:
>>>> I am trying to join a Dataframe(say 100 records) with an ORC file with 500 
>>>> million records through Spark(can increase to 4-5 billion, 25 bytes each 
>>>> record).
>>>> 
>>>> I used Spark hiveContext API.
>>>> 
>>>> ORC File Creation Code
>>>> 
>>>> //fsdtRdd is JavaRDD, fsdtSchema is StructType schema
>>>> DataFrame fsdtDf = hiveContext.createDataFrame(fsdtRdd,fsdtSchema);
>>>> fsdtDf.write().mode(SaveMode.Overwrite).orc("orcFileToRead");
>>>> ORC File Reading Code
>>>> 
>>>> HiveContext hiveContext = new HiveContext(sparkContext);
>>>> DataFrame orcFileData= hiveContext.read().orc("orcFileToRead");
>>>> // allRecords is dataframe
>>>> DataFrame processDf = 
>>>> allRecords.join(orcFileData,allRecords.col("id").equalTo(orcFileData.col("id").as("ID")),"left_outer_join");
>>>> processDf.show();
>>>> When I read the ORC file, the get following in my Spark Logs:
>>>> 
>>>> Input split: 
>>>> file:/C:/AOD_PID/PVP.vincir_frst_seen_tran_dt_ORC/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc:0+3163348
>>>> min key = null, max key = null
>>>> Reading ORC rows from 
>>>> file:/C:/AOD_PID/PVP.vincir_frst_seen_tran_dt_ORC/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc
>>>>  with {include: [true, true, true], offset: 0, length: 9223372036854775807}
>>>> Finished task 55.0 in stage 2.0 (TID 59). 2455 bytes result sent to driver
>>>> Starting task 56.0 in stage 2.0 (TID 60, localhost, partition 
>>>> 56,PROCESS_LOCAL, 2220 bytes)
>>>> Finished task 55.0 in stage 2.0 (TID 59) in 5846 ms on localhost (56/84)
>>>> Running task 56.0 in stage 2.0 (TID 60)
>>>> Although the Spark job completes successfully, I think, its not able to 
>>>> utilize ORC index file capability and thus checks through entire block of 
>>>> ORC data before moving on.
>>>> 
>>>> Question
>>>> 
>>>> -- Is it a normal behaviour, or I have to set any configuration before 
>>>> saving the data in ORC format?
>>>> 
>>>> -- If it is NORMAL, what is the best way to join so that we discrad 
>>>> non-matching records on the disk level(maybe only the index file for ORC 
>>>> data is loaded)?
>>>> 
>> 
>> 
>> 
>> -- 
>> Thanks and Regards
>> Mohan
>> VISA Pte Limited, Singapore.
> 

Reply via email to