Hi Mich,

Thank you for your reply.

Let me explain more clearly.

File with 100 records needs to joined with a Big lookup File created in ORC
format (500 million records). The Spark process i wrote is returing back
the matching records and is working fine. My concern is that it loads the
entire file (500 million) and matches with the 100 records instead of
loading only the stripes with matching keys. I read that ORC file provides
indexes (https://orc.apache.org/docs/indexes.html) and i assumned that when
i join using Dataframes, the indexes will be used, resulting in loading of
only matching records/stripes for processing instead of the whole table.

On Mon, Jun 20, 2016 at 1:00 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> To start when you store the data in ORC file can you verify that the data
> is there?
>
> For example register it as tempTable
>
> processDF.register("tmp")
> sql("select count(1) from tmp).show
>
> Also what do you mean by index file in ORC?
>
> HTH
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 20 June 2016 at 05:01, Mohanraj Ragupathiraj <mohanaug...@gmail.com>
> wrote:
>
>> I am trying to join a Dataframe(say 100 records) with an ORC file with
>> 500 million records through Spark(can increase to 4-5 billion, 25 bytes
>> each record).
>>
>> I used Spark hiveContext API.
>>
>> *ORC File Creation Code*
>>
>> //fsdtRdd is JavaRDD, fsdtSchema is StructType schema
>> DataFrame fsdtDf = hiveContext.createDataFrame(fsdtRdd,fsdtSchema);
>> fsdtDf.write().mode(SaveMode.Overwrite).orc("orcFileToRead");
>>
>> *ORC File Reading Code*
>>
>> HiveContext hiveContext = new HiveContext(sparkContext);
>> DataFrame orcFileData= hiveContext.read().orc("orcFileToRead");
>> // allRecords is dataframe
>> DataFrame processDf = 
>> allRecords.join(orcFileData,allRecords.col("id").equalTo(orcFileData.col("id").as("ID")),"left_outer_join");
>> processDf.show();
>>
>> When I read the ORC file, the get following in my Spark Logs:
>>
>> Input split: 
>> file:/C:/AOD_PID/PVP.vincir_frst_seen_tran_dt_ORC/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc:0+3163348*min
>>  key = null, max key = null*
>> Reading ORC rows from 
>> file:/C:/AOD_PID/PVP.vincir_frst_seen_tran_dt_ORC/part-r-00024-b708c946-0d49-4073-9cd1-5cc46bd5972b.orc
>>  with {include: [true, true, true], offset: 0, length: 9223372036854775807}
>> Finished task 55.0 in stage 2.0 (TID 59). 2455 bytes result sent to driver
>> Starting task 56.0 in stage 2.0 (TID 60, localhost, partition 
>> 56,PROCESS_LOCAL, 2220 bytes)
>> Finished task 55.0 in stage 2.0 (TID 59) in 5846 ms on localhost (56/84)
>> Running task 56.0 in stage 2.0 (TID 60)
>>
>> Although the Spark job completes successfully, I think, its not able to
>> utilize ORC index file capability and thus checks through entire block of
>> ORC data before moving on.
>>
>> *Question*
>>
>> -- Is it a normal behaviour, or I have to set any configuration before
>> saving the data in ORC format?
>>
>> -- If it is *NORMAL*, what is the best way to join so that we discrad
>> non-matching records on the disk level(maybe only the index file for ORC
>> data is loaded)?
>>
>
>


-- 
Thanks and Regards
Mohan
VISA Pte Limited, Singapore.

Reply via email to