I just noticed you found 1.4 has the same issue. I added that as well in the ticket.
On Mon, Sep 21, 2015 at 1:43 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Yin, > > You are right! I just tried the scala version with the above lines, it > works as expected. > I'm not sure if it happens also in 1.4 for pyspark but I thought the > pyspark code just calls the scala code via py4j. I didn't expect that this > bug is pyspark specific. That surprises me actually a bit. I created a > ticket for this (SPARK-10731 > <https://issues.apache.org/jira/browse/SPARK-10731>). > > Best Regards, > > Jerry > > > On Mon, Sep 21, 2015 at 1:01 PM, Yin Huai <yh...@databricks.com> wrote: > >> btw, does 1.4 has the same problem? >> >> On Mon, Sep 21, 2015 at 10:01 AM, Yin Huai <yh...@databricks.com> wrote: >> >>> Hi Jerry, >>> >>> Looks like it is a Python-specific issue. Can you create a JIRA? >>> >>> Thanks, >>> >>> Yin >>> >>> On Mon, Sep 21, 2015 at 8:56 AM, Jerry Lam <chiling...@gmail.com> wrote: >>> >>>> Hi Spark Developers, >>>> >>>> I just ran some very simple operations on a dataset. I was surprise by >>>> the execution plan of take(1), head() or first(). >>>> >>>> For your reference, this is what I did in pyspark 1.5: >>>> df=sqlContext.read.parquet("someparquetfiles") >>>> df.head() >>>> >>>> The above lines take over 15 minutes. I was frustrated because I can do >>>> better without using spark :) Since I like spark, so I tried to figure out >>>> why. It seems the dataframe requires 3 stages to give me the first row. It >>>> reads all data (which is about 1 billion rows) and run Limit twice. >>>> >>>> Instead of head(), show(1) runs much faster. Not to mention that if I >>>> do: >>>> >>>> df.rdd.take(1) //runs much faster. >>>> >>>> Is this expected? Why head/first/take is so slow for dataframe? Is it a >>>> bug in the optimizer? or I did something wrong? >>>> >>>> Best Regards, >>>> >>>> Jerry >>>> >>> >>> >> >