Re: Notes of interest from Apache Pig Hackday, Austin edition

Jeremy Hanna Sat, 12 May 2012 14:26:17 -0700

On May 12, 2012, at 3:42 PM, Jonathan Coveney wrote:

> Wow, that writeup was awesome! And your hackday was really well attended!
> Was it all dachis group people, or did it include people from other Austin
> tech companies?


Thanks for all the help Jonathan.  We had about 10 of the thirty people from 
the Dachis Group.  Other companies represented were vast.com and truecar.com 
which each had several employees there, Bioware, Dell, HP, Freescale 
Semiconductor, Spredfast.com, PayPal, and the University of Texas.

> 
> I think in the future it'd be nice to figure out how to sync with the
> remote hackers... I think given that the Austin hack day was more about
> usage that the format was ok for this one, but as you guys get ramped up,
> it'd be great to collaborate more directly! And my offer to be flown out to
> Austin for a hack day still stands ;)

Yep - would be great to do this kind of thing again.

> 
> Jeremy is right in that on our end it was more about just crunching through
> some tickets, but we did help some users get ramped up with Pig (getting
> pig into eclipse, using git to fork the pig project on github), and we also
> had some good chats about higher level issues with Pig or the ecosystem.
> 
> I personally came away with some projects that could be of varying interest
> for Pig...
> 
> 1. Pull out the logical planner from Pig in such a way that it can target a
> generic physical plan (or something like cascading), and so that other
> projects (hive, scalding) can target it. This is something that people have
> wanted for a long time, but is pretty nontrivial to design. As the
> ecosystem of tools gets more sophisticated, though, the need is really
> really growing...we're getting to the point where there are some pretty
> sophisticated optimizations that could be put into Hive, Pig, etc and the
> duplication of labor is getting very expensive.
> 
> 2. Daniel and I chatted about a possible way to chain operators to save on
> namespace. IE instead of doing
> 
> A = load 'thing' as (x:int, y:int);
> B = group A by x;
> C = foreach B generate group, COUNT(A), SUM(A.y);
> D = FILTER C by $1 > 5, $2 > 7;
> 
> you could do
> 
> A = load 'thing' as (x:int, y:int);
> => group _ by x;
> => foreach _ generate group, COUNT(_), SUM(_.y);
> => FILTER _ by $1 > 5 and $2 > 7;
> 
> and now A would be the result of the entire chain.
> 
> 3. Everyone agrees that EvalFuncs need a major overhaul, and someone should
> move to submit a proposal because right now it's just sort of languishing.
> 
> 4. ONERROR would be a real coup for pig...there's a spec, someone just
> needs to do the work!
> 
> And then there are various and sundry things that I would like to
> do...finish up SchemaTuple, move on to SchemaBag, and so on.
> 
> 2012/5/12 Jagat <[email protected]>
> 
>> Wow Jeremy ,
>> 
>> Thanks for detailed coverage. Seems you guys did lots of good work along
>> with fun.
>> 
>> -----------
>> Sent from Mobile , short and crisp.
>> On 12-May-2012 11:53 PM, "Jeremy Hanna" <[email protected]>
>> wrote:
>> 
>>> Thanks again to Twitter for doing their event and inspiring ours.  I just
>>> wanted to report on some things we did in Austin for any interested.  We
>>> had a good turnout of about 30 people.
>>> 
>>> Kevin Safford presented an introduction to Pig, or Pig 101.  The slides
>>> are available here:
>>> http://www.slideshare.net/ktsafford/dachis-group-pigout101-12895911
>>> 
>>> Timothy Potter down from Colorado gave a presentation on intermediate
>> Pig,
>>> or Pig 202.  His slides are available here:
>>> http://www.slideshare.net/thelabdude/dachis-group-pig-hackday-pig-202
>>> 
>>> Clint Miller gave an introduction to unit testing with Pig with these
>>> slides: http://www.slideshare.net/clintmiller1/unit-testing-pig
>>> 
>>> After that we had some lunch and linked up remotely for a bit to the
>>> Twitter hackday in the Bay Area.  Their group is mostly Pig committers
>> and
>>> contributors so they worked on Pig tickets.  One thing that Twitter
>>> opensourced as part of the event was a workflow visualization tool called
>>> Ambrose, https://github.com/twitter/ambrose
>>> 
>>> Also mentioned was Alan Gates excellent reference Programming Pig, the
>> web
>>> version found here:
>>> http://ofps.oreilly.com/titles/9781449302641/index.html
>>> We started the afternoon with a list of things we could work on:
>>> 
>>>       • Pig mahout integration (pigout) led by Timothy Potter
>>>       • Pig Unit improvments led by Clint Miller
>>>       • David Boney wanted to get his KDD data preparation going with
>> Pig
>>> for a competition
>>>       • Kevin wanted to help people get the presentation examples
>> running
>>>       • Brandon Kearby led a group on helping get the IntelliJ IDEA Pig
>>> plugin working.
>>>       • Josh Levy wanted to see about getting grunt to recognize
>>> parameters passed in.
>>>       • Josh also wanted to look more at the python udf scripting and
>> see
>>> if it could be improved.
>>>       • John Prior wanted see if there could be a grunt pretty print
>> when
>>> using describe
>>>       • John also wanted to see if bash command history facilities could
>>> be added to grunt
>>>       • John also brought up that knime is a really cool visual workflow
>>> creator for machine learning that could also could be developed for Pig.
>>>       • The CassandraStorage loadstorefunc was also brought up as
>>> something Brandon Williams might work on, specifically the way to have it
>>> automatically use secondary indexes.
>>> 
>>> What actually happened?
>>> 
>>> Tim is going to continue working on the pig-vector integration into
>> Mahout
>>> pending some feedback from Tim and the mahout folks.
>>> 
>>> Clint worked on getting Pig 0.10 branch downloaded and built locally in
>>> order to have something to patch against for the pig unit improvements
>>> outlined on this ticket:  https://issues.apache.org/jira/browse/PIG-2692
>>> 
>>> David Boney got his data loaded up in CFS, the Cassandra file system and
>>> made some progress there.
>>> 
>>> Several people talked about Pig generally getting things running on their
>>> own laptops and environments.
>>> 
>>> Brandon Kearby and others forked
>>> https://github.com/brandonkearby/three-little-piggies and the jar in
>> that
>>> project can now be added to your IntelliJ IDEA plugins directory to
>>> associate .pig files and provide source coloring.  There's still some
>> work
>>> to do there, but it's nice to have that working and available for
>> IntelliJ
>>> 11 users.
>>> 
>>> Josh Levy got some ideas together with a couple of other attendees on how
>>> to improve the Pig/Python UDF scripting.  Josh and Jeremy contacted
>> Julien
>>> from Twitter who had written the python udf support and he is reviewing
>>> Josh's proposed changes with the possibility of creating a ticket for it.
>>> 
>>> Grunt pretty print?  Coincidentally, someone in the Bay Area had the same
>>> thought and independent of our efforts created a ticket along with
>>> submitted a patch to do just that:
>>> https://issues.apache.org/jira/browse/PIG-2697
>>> 
>>> Brandon Williams is working on the CassandraStorage ticket -
>>> https://issues.apache.org/jira/browse/CASSANDRA-4238
>>> 
>>> Besides that there was great interaction among everyone until people went
>>> their own ways around 4 PM.  Thanks for Twitter for doing their
>> hackathon.
>>> We didn't interact too much with them because their group was more
>>> advanced and we didn't want to slow them down.  Several of us chatted in
>>> the #hadoop-pig channel on freenode (IRC) as well as Russell Jurney and
>>> Jonathan Coveney from the Bay Area.
>>> 
>>> Cheers,
>>> 
>>> Jeremy
>>> 
>>

Re: Notes of interest from Apache Pig Hackday, Austin edition

Reply via email to