Also - you could consider caching your data after the first split (before
the first filter), this will prevent you from retrieving the data from s3
twice.


On Fri, Jun 20, 2014 at 8:32 AM, Xiangrui Meng <men...@gmail.com> wrote:

> Your data source is S3 and data is used twice. m1.large does not have very
> good network performance. Please try file.count() and see how fast it goes.
> -Xiangrui
>
> > On Jun 20, 2014, at 8:16 AM, mathias <math...@socialsignificance.co.uk>
> wrote:
> >
> > Hi there,
> >
> > We're trying out Spark and are experiencing some performance issues using
> > Spark SQL.
> > Anyone who can tell us if our results are normal?
> >
> > We are using the Amazon EC2 scripts to create a cluster with 3
> > workers/executors (m1.large).
> > Tried both spark 1.0.0 as well as the git master; the Scala as well as
> the
> > Python shells.
> >
> > Running the following code takes about 5 minutes, which seems a long time
> > for this query.
> >
> > val file = sc.textFile("s3n:// ...  .csv");
> > val data = file.map(x => x.split('|')); // 300k rows
> >
> > case class BookingInfo(num_rooms: String, hotelId: String, toDate:
> String,
> > ...);
> > val rooms2 = data.filter(x => x(0) == "2").map(x => BookingInfo(x(0),
> x(1),
> > ... , x(9))); // 50k rows
> > val rooms3 = data.filter(x => x(0) == "3").map(x => BookingInfo(x(0),
> x(1),
> > ... , x(9))); // 30k rows
> >
> > rooms2.registerAsTable("rooms2");
> > cacheTable("rooms2");
> > rooms3.registerAsTable("rooms3");
> > cacheTable("rooms3");
> >
> > sql("SELECT * FROM rooms2 LEFT JOIN rooms3 ON rooms2.hotelId =
> > rooms3.hotelId AND rooms2.toDate = rooms3.toDate").count();
> >
> >
> > Are we doing something wrong here?
> > Thanks!
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Performance-problems-on-SQL-JOIN-tp8001.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to