Thanks. This is an insightful discussion. Having just glanced now at both Plume and Crunch these seem similar to Cascading in the sense of being dataflow languages. I wonder are you able to comment on if there are important distinctions. C On Oct 31, 2011, at 5:07 PM, Ted Dunning wrote:
> Yeah... > > But that doesn't help when I want to write a Pig library for you. It also > doesn't help when I want to write a pig script that calls your library > stuff in the middle and then passes the result to something that Jake > wrote. Pig's optimizer can't build a complete data flow across that > composite program. > > It does help a bit with the problem of, say, iterating over files in a > directory. > > My preference is languages like FlumeJava which start with java and use > builder-style API to inject the data flow specification. > > On Mon, Oct 31, 2011 at 12:54 PM, Dan Brickley <[email protected]> wrote: > >> On 31 October 2011 20:22, Ted Dunning <[email protected]> wrote: >>> On Mon, Oct 31, 2011 at 12:00 PM, Dan Brickley <[email protected]> >> wrote: >>> >>>> On 31 October 2011 17:27, Ted Dunning <[email protected]> wrote: >>>>> I think this would be very interesting to see. Whether it should be >> part >>>>> of Mahout or a separate project is an open question. >>>>> >>>>> PIG, is, unfortunately not a real language in the sense of turing >>>>> completion or extensibility. It is good at what it does, but not at >>>> being >>>>> extended to do more. >>>> >>>> ...although you can call out to functions defined in Java, Python etc. >>>> This doesn't make the top level language into a programming language, >>>> though. Was that your point, Ted? >>>> Yes. That was the point. Calling out is different from being able to >>> control the process from the outside in. >> >> I've just found http://wiki.apache.org/pig/TuringCompletePig which has >> copious notes on ways to address this. Excerpting a little: >> >> """Pig Latin is a data flow language. As such it does not offer users >> control flow and modularity features that are present in general >> purpose programming languages, including functions, modules, loops, >> and branches. Given that it is a data flow language adding these >> constructs is neither straightforward nor reasonable. However, users >> do want to be able to integrate standard programming techniques of >> separation and code sharing offered by functions and modules as well >> as integration of control flow offered by functions, loops, and >> branches. This document proposes a way to accomplish these goals while >> preserving Pig Latin's data flow orientation.""" >> >> Spoiler alert (wiki page has a lot more detail). Plan seems to be >> combination of macros (which are now in the language) and "second part >> of the proposal is to embed Pig Latin scripts in the host scripting >> language via a JDBC like compile, bind, run model. " >> >> I'm not sure how far along that part is... >> >> Dan >> >> ps. the following 3 links have everything I attempted before with >> Pig/Mahout integration; not a lot, but it left me intrigued and >> frustrated in equal measure. >> >> http://www.mail-archive.com/[email protected]/msg02848.html >> https://gist.github.com/1192831 >> >> http://search-lucene.com/m/IOfRIc6wGq1&subj=+Unknown+program+chosen+Valid+program+names+are+truncated+list+from+Hadoop+program+driver >>
