Thanks Eric. That is the case as most of my fields are optional. So it seems that the problem comes from Parquet.
On Sat, Jul 19, 2014 at 8:27 AM, Eric Friedman <eric.d.fried...@gmail.com> wrote: > Can position be null? Looks like there may be constraints with predicate > push down in that case. https://github.com/apache/spark/pull/511/ > > On Jul 18, 2014, at 8:04 PM, Christos Kozanitis <kozani...@berkeley.edu> > wrote: > > Hello > > What is the order with which SparkSQL deserializes parquet fields? Is it > possible to modify it? > > I am using SparkSQL to query a parquet file that consists of a lot of > fields (around 30 or so). Let me call an example table MyTable and let's > suppose the name of one of its fields is "position". > > The query that I am executing is: > sql("select * from MyTable where position = 243189160") > > The query plan that I get from this query is: > Filter (position#6L:6 = 243189160) > ParquetTableScan > [contig.contigName#0,contig.contigLength#1L,contig.contigMD5#2,contig.referenceURL#3,contig.assembly#4,contig.species#5,position#6L,rangeOffset#7,rangeLength#8,referenceBase#9,readBase#10,sangerQuality#11,mapQuality#12,numSoftClipped#13,numReverseStrand#14,countAtPosition#15,readName#16,readStart#17L,readEnd#18L,recordGroupSequencingCenter#19,recordGroupDescription#20,recordGroupRunDateEpoch#21L,recordGroupFlowOrder#22,recordGroupKeySequence#23,recordGroupLibrary#24,recordGroupPredictedMedianInsertSize#25,recordGroupPlatform#26,recordGroupPlatformUnit#27,recordGroupSample#28], > (ParquetRelation hdfs:// > ec2-54-89-87-167.compute-1.amazonaws.com:9000/genomes/hg00096.plup), None > > I expect 14 entries in the output but the execution of > .collect.foreach(println) takes forever to run on my cluster (more than an > hour). > > Is it safe to assume in my example that SparkSQL deserializes all fields > first before applying the filter? If so, can a user change this behavior? > > To support my assumption I replaced "*" with "position", so my new query > is of the form sql("select position from MyTable where position = > 243189160") and this query runs much faster on the same hardware (2-3 > minutes vs 65 min). > > Any ideas? > > thanks > Christos > >