Thanks Eric. That is the case as most of my fields are optional. So it
seems that the problem comes from Parquet.


On Sat, Jul 19, 2014 at 8:27 AM, Eric Friedman <eric.d.fried...@gmail.com>
wrote:

> Can position be null?  Looks like there may be constraints with predicate
> push down in that case. https://github.com/apache/spark/pull/511/
>
> On Jul 18, 2014, at 8:04 PM, Christos Kozanitis <kozani...@berkeley.edu>
> wrote:
>
> Hello
>
> What is the order with which SparkSQL deserializes parquet fields? Is it
> possible to modify it?
>
> I am using SparkSQL to query a parquet file that consists of a lot of
> fields (around 30 or so). Let me call an example table MyTable and let's
> suppose the name of one of its fields is "position".
>
> The query that I am executing is:
> sql("select * from MyTable where position = 243189160")
>
> The query plan that I get from this query is:
> Filter (position#6L:6 = 243189160)
>  ParquetTableScan
> [contig.contigName#0,contig.contigLength#1L,contig.contigMD5#2,contig.referenceURL#3,contig.assembly#4,contig.species#5,position#6L,rangeOffset#7,rangeLength#8,referenceBase#9,readBase#10,sangerQuality#11,mapQuality#12,numSoftClipped#13,numReverseStrand#14,countAtPosition#15,readName#16,readStart#17L,readEnd#18L,recordGroupSequencingCenter#19,recordGroupDescription#20,recordGroupRunDateEpoch#21L,recordGroupFlowOrder#22,recordGroupKeySequence#23,recordGroupLibrary#24,recordGroupPredictedMedianInsertSize#25,recordGroupPlatform#26,recordGroupPlatformUnit#27,recordGroupSample#28],
> (ParquetRelation hdfs://
> ec2-54-89-87-167.compute-1.amazonaws.com:9000/genomes/hg00096.plup), None
>
> I expect 14 entries in the output but the execution of
> .collect.foreach(println) takes forever to run on my cluster (more than an
> hour).
>
> Is it safe to assume in my example that SparkSQL deserializes all fields
> first before applying the filter? If so, can a user change this behavior?
>
> To support my assumption I replaced "*" with "position", so my new query
> is of the form sql("select position from MyTable where position =
> 243189160") and this query runs much faster on the same hardware (2-3
> minutes vs 65 min).
>
> Any ideas?
>
> thanks
> Christos
>
>

Reply via email to