Hello

What is the order with which SparkSQL deserializes parquet fields? Is it
possible to modify it?

I am using SparkSQL to query a parquet file that consists of a lot of
fields (around 30 or so). Let me call an example table MyTable and let's
suppose the name of one of its fields is "position".

The query that I am executing is:
sql("select * from MyTable where position = 243189160")

The query plan that I get from this query is:
Filter (position#6L:6 = 243189160)
 ParquetTableScan
[contig.contigName#0,contig.contigLength#1L,contig.contigMD5#2,contig.referenceURL#3,contig.assembly#4,contig.species#5,position#6L,rangeOffset#7,rangeLength#8,referenceBase#9,readBase#10,sangerQuality#11,mapQuality#12,numSoftClipped#13,numReverseStrand#14,countAtPosition#15,readName#16,readStart#17L,readEnd#18L,recordGroupSequencingCenter#19,recordGroupDescription#20,recordGroupRunDateEpoch#21L,recordGroupFlowOrder#22,recordGroupKeySequence#23,recordGroupLibrary#24,recordGroupPredictedMedianInsertSize#25,recordGroupPlatform#26,recordGroupPlatformUnit#27,recordGroupSample#28],
(ParquetRelation hdfs://
ec2-54-89-87-167.compute-1.amazonaws.com:9000/genomes/hg00096.plup), None

I expect 14 entries in the output but the execution of
.collect.foreach(println) takes forever to run on my cluster (more than an
hour).

Is it safe to assume in my example that SparkSQL deserializes all fields
first before applying the filter? If so, can a user change this behavior?

To support my assumption I replaced "*" with "position", so my new query is
of the form sql("select position from MyTable where position = 243189160")
and this query runs much faster on the same hardware (2-3 minutes vs 65
min).

Any ideas?

thanks
Christos

Reply via email to