Arun, On Thu, Sep 18, 2014 at 9:52 AM, Arun Luthra <arun.lut...@gmail.com> wrote:
> I'm doing a spark SQL benchmark similar to the code in > https://spark.apache.org/docs/latest/sql-programming-guide.html > (section: Inferring the Schema Using Reflection**). What's the simplest way > to time the SQL statement itself, so that I'm not timing > the .map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt) part of the > RDD creation? I'm using a few calls to System.nanoTime() for timing. > I think to isolate that part, you should call persist() after your split(...) etc. operation and then access the data with an output operation (such as count()) so that the computation will be executed. After that, if you use the same RDD in your SQL statement, it should use the persisted data and you should be more or less able to just measure the time needed for the SQL processing. OK? Tobias