Thanks. This is an insightful discussion. Having just glanced now at both Plume 
and Crunch these seem similar to Cascading in the sense of being dataflow 
languages. I wonder are you able to comment on if there are important 
distinctions.
C
On Oct 31, 2011, at 5:07 PM, Ted Dunning wrote:

> Yeah...
> 
> But that doesn't help when I want to write a Pig library for you.  It also
> doesn't help when I want to write a pig script that calls your library
> stuff in the middle and then passes the result to something that Jake
> wrote.  Pig's optimizer can't build a complete data flow across that
> composite program.
> 
> It does help a bit with the problem of, say, iterating over files in a
> directory.
> 
> My preference is languages like FlumeJava which start with java and use
> builder-style API to inject the data flow specification.
> 
> On Mon, Oct 31, 2011 at 12:54 PM, Dan Brickley <[email protected]> wrote:
> 
>> On 31 October 2011 20:22, Ted Dunning <[email protected]> wrote:
>>> On Mon, Oct 31, 2011 at 12:00 PM, Dan Brickley <[email protected]>
>> wrote:
>>> 
>>>> On 31 October 2011 17:27, Ted Dunning <[email protected]> wrote:
>>>>> I think this would be very interesting to see.  Whether it should be
>> part
>>>>> of Mahout or a separate project is an open question.
>>>>> 
>>>>> PIG, is, unfortunately not a real language in the sense of turing
>>>>> completion or extensibility.  It is good at what it does, but not at
>>>> being
>>>>> extended to do more.
>>>> 
>>>> ...although you can call out to functions defined in Java, Python etc.
>>>> This doesn't make the top level language into a programming language,
>>>> though. Was that your point, Ted?
>>>> Yes.  That was the point.  Calling out is different from being able to
>>> control the process from the outside in.
>> 
>> I've just found http://wiki.apache.org/pig/TuringCompletePig which has
>> copious notes on ways to address this. Excerpting a little:
>> 
>> """Pig Latin is a data flow language. As such it does not offer users
>> control flow and modularity features that are present in general
>> purpose programming languages, including functions, modules, loops,
>> and branches. Given that it is a data flow language adding these
>> constructs is neither straightforward nor reasonable. However, users
>> do want to be able to integrate standard programming techniques of
>> separation and code sharing offered by functions and modules as well
>> as integration of control flow offered by functions, loops, and
>> branches. This document proposes a way to accomplish these goals while
>> preserving Pig Latin's data flow orientation."""
>> 
>> Spoiler alert (wiki page has a lot more detail).  Plan seems to be
>> combination of macros (which are now in the language) and "second part
>> of the proposal is to embed Pig Latin scripts in the host scripting
>> language via a JDBC like compile, bind, run model. "
>> 
>> I'm not sure how far along that part is...
>> 
>> Dan
>> 
>> ps. the following 3 links have everything I attempted before with
>> Pig/Mahout integration; not a lot, but it left me intrigued and
>> frustrated in equal measure.
>> 
>> http://www.mail-archive.com/[email protected]/msg02848.html
>> https://gist.github.com/1192831
>> 
>> http://search-lucene.com/m/IOfRIc6wGq1&subj=+Unknown+program+chosen+Valid+program+names+are+truncated+list+from+Hadoop+program+driver
>> 

Reply via email to