Re: Pig on Spark

Bharath Mundlapudi Wed, 23 Apr 2014 12:55:52 -0700

This seems like an interesting question.

I love Apache Pig. It is so natural and the language flows with nice syntax.


While I was at Yahoo! in core Hadoop Engineering, I have used Pig a lot for
analytics and provided feedback to Pig Team to do much more functionality
when it was at version 0.7. Lots of new functionality got offered now
.
End of the day, Pig is a DSL for data flows. There will be always gaps and
enhancements. I was often thought is DSL right way to solve data flow
problems? May be not, we need complete language construct. We may have
found the answer - Scala. With Scala's dynamic compilation, we can write
much power constructs than any DSL can provide.

If I am a new organization and beginning to choose, I would go with Scala.

Here is the example:

#!/bin/sh
exec scala "$0" "$@"
!#
YOUR DSL GOES HERE BUT IN SCALA!

You have DSL like scripting, functional and complete language power! If we
can improve first 3 lines, here you go, you have most powerful DSL to solve
data problems.

-Bharath





On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <men...@gmail.com> wrote:

> Hi Sameer,
>
> Lin (cc'ed) could also give you some updates about Pig on Spark
> development on her side.
>
> Best,
> Xiangrui
>
> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ssti...@live.com> wrote:
> > Hi Mayur,
> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the goal
> is
> > to get SPROK set up next month. I will keep you posted. Can you please
> keep
> > me informed about your progress as well.
> >
> > ________________________________
> > From: mayur.rust...@gmail.com
> > Date: Mon, 10 Mar 2014 11:47:56 -0700
> >
> > Subject: Re: Pig on Spark
> > To: user@spark.apache.org
> >
> >
> > Hi Sameer,
> > Did you make any progress on this. My team is also trying it out would
> love
> > to know some detail so progress.
> >
> > Mayur Rustagi
> > Ph: +1 (760) 203 3257
> > http://www.sigmoidanalytics.com
> > @mayur_rustagi
> >
> >
> >
> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ssti...@live.com> wrote:
> >
> > Hi Aniket,
> > Many thanks! I will check this out.
> >
> > ________________________________
> > Date: Thu, 6 Mar 2014 13:46:50 -0800
> > Subject: Re: Pig on Spark
> > From: aniket...@gmail.com
> > To: user@spark.apache.org; tgraves...@yahoo.com
> >
> >
> > There is some work to make this work on yarn at
> > https://github.com/aniket486/pig. (So, compile pig with ant
> > -Dhadoopversion=23)
> >
> > You can look at https://github.com/aniket486/pig/blob/spork/pig-spark to
> > find out what sort of env variables you need (sorry, I haven't been able
> to
> > clean this up- in-progress). There are few known issues with this, I will
> > work on fixing them soon.
> >
> > Known issues-
> > 1. Limit does not work (spork-fix)
> > 2. Foreach requires to turn off schema-tuple-backend (should be a
> pig-jira)
> > 3. Algebraic udfs dont work (spork-fix in-progress)
> > 4. Group by rework (to avoid OOMs)
> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf jars)
> >
> > ~Aniket
> >
> >
> >
> >
> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tgraves...@yahoo.com> wrote:
> >
> > I had asked a similar question on the dev mailing list a while back (Jan
> > 22nd).
> >
> > See the archives:
> > http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
> > look for spork.
> >
> > Basically Matei said:
> >
> > Yup, that was it, though I believe people at Twitter picked it up again
> > recently. I'd suggest
> > asking Dmitriy if you know him. I've seen interest in this from several
> > other groups, and
> > if there's enough of it, maybe we can start another open source repo to
> > track it. The work
> > in that repo you pointed to was done over one week, and already had most
> of
> > Pig's operators
> > working. (I helped out with this prototype over Twitter's hack week.)
> That
> > work also calls
> > the Scala API directly, because it was done before we had a Java API; it
> > should be easier
> > with the Java one.
> >
> >
> > Tom
> >
> >
> >
> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ssti...@live.com>
> wrote:
> > Hi everyone,
> >
> > We are using to Pig to build our data pipeline. I came across Spork --
> Pig
> > on Spark at: https://github.com/dvryaboy/pig and not sure if it is still
> > active.
> >
> > Can someone please let me know the status of Spork or any other effort
> that
> > will let us run Pig on Spark? We can significantly benefit by using
> Spark,
> > but we would like to keep using the existing Pig scripts.
> >
> >
> >
> >
> >
> > --
> > "...:::Aniket:::... Quetzalco@tl"
> >
> >
>

Re: Pig on Spark

Reply via email to