[Pig Wiki] Update of "ProposedRoadMap" by AlanGates

Apache Wiki Wed, 07 Nov 2007 10:34:02 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change 
notification.


The following page has been changed by AlanGates:
http://wiki.apache.org/pig/ProposedRoadMap

New page:
[[Anchor(Pig_Road_Map)]]
= Pig Road Map =

The following document was developed as a roadmap for pig at Yahoo prior to pig 
being released as open source. 

[[Anchor(Abstract)]]
== Abstract ==
This document lays out the road map for pig as of the end of Q3 2007.  It 
begins by laying out the [#Pig_Vision vision] for pig and describing its
[#Current_Status current state].  It [#Development_Categories categorizes] 
different features to be worked on and discusses
the priorities of those categories.  Each feature is described in 
[#Feature_Details detail].  Finally, it discusses how individual
features will be [#Priorities prioritized].

[[Anchor(Pig_Vision)]]
== Pig Vision ==
The pig query language provides several significant advantages compared to 
straight map reduce:
   1. There are common data query operations that most, if not all, users 
querying data make use of.  These include project, filter, aggregate, join, and 
sort (all of the common data base relational operators).  Pig provides these in 
the map reduce framework so that users need not implement these common 
operators over and over, while still allowing the user to include their custom 
code for non-common operations.
   2. The use of a declarative language opens up grid data queries to users who 
do not have the expertise, time, or inclination to write java programs to 
answer their queries.
   3. A great majority of data queries can be answered using the provided 
operators.  A declarative language greatly reduces the development effort 
needed by users to make use of grid data.
   4. An interactive shell allows users to easily experiment with their data 
and queries in order to shorten their development cycle.

All of this comes at a price in performance and flexibility.  An expert 
prorammer will always be able to write more efficient code in a lower level
language to execute a given job than can be done in general purpose higher 
level language.  Also, some types of jobs may not fit nicely into pig's query
model.  For these reasons it is not envisioned that pig should become the only 
way to access data on the grid.  However, given sufficient performance,
stability, and ease of use, the majority of users will find pig an easier way 
to write, test, and maintain their code.

Given this vision, the goals for the pig team over the next twelve months are:
   1. Make pig a stable, usable, production quality product.
   2. Make pig user friendly in terms of understanding the data being queried, 
how queries are executed, etc.
   3. Provide performance at or near the same level available from implementing 
similar operations directly in map reduce.
   4. Provide flexibility for users to integrate their code, in any of several 
supported common languages, into their pig queries.
   6. Provide users a quality shell in which to navigate the HDFS and do query 
development.
   7. Provide tools for grid administrators to understand how their data is 
being queried and created via pig.

[[Anchor(Current_Status)]]
== Current Status ==
The current status of pig, measured against the seven goals above is:
   1. Release 1.2 of pig focussed mainly on goal 1 of making pig stable and 
production quality.  There is still work to be done in this, particulary in the 
areas of error handling and documentation.  
   2. Very little support is given to users at this time to assist them in 
understanding the structure of the data they are querying, how their queries 
will be executed, etc.
   3. Pig performance against similar code written directly in map reduce has 
not been tested.
   4. Users can integrate java code with pig.  Other languages are not 
supported.  The supported java code must be a function that accepts a row at a 
time and outputs one or more rows in response.  Other models, such as 
streaming, are not supported.
   6. The grunt shell provides the ability to browse the HDFS and to create pig 
queries.  It is very rudimentary, lacking many features of a modern shell.
   7. There are no tools currently to allow a grid administrator to monitor how 
pig is being used on the grid.

[[Anchor(Development_Categories)]]
== Development Categories ==
Features under consideration for development are categorized as:
|| '''Category''' || '''Priority''' || '''Types of Features''' || 
'''Comments''' ||
|| Engineering manageability || High || release engineering, testing || ||
|| Infrastructure || High || architectural changes, such as metadata, types || 
||
|| Performance || High / Medium || || Changes that bring pig into near parity 
with direct map reduce are high priority, others are medium ||
|| Production quality || Medium ||  good error handling, reliability || ||
|| User experience || Medium || ease of use, providing adequate information for 
the user || ||
|| Security || Low || || Do not have requirements or security support from HDFS 
yet ||

[[Anchor(Feature_Details)]]
== Feature Details ==
For each of the features, the following information is given:

'''Explanation''':  What this feature is.

'''Rational''':  Why it is needed.  This should include a reference to which of 
the pig goals mentioned above that this meets.

'''Category''':  Which of the above [#Development_Categories categories] the 
feature fits into.

'''Requestor''':  Indicates whether the feature was requested by a user, 
development, qa, administartor (that is grid administrators), or management.

'''Depends On''':  Other featuers that must be implemented before this one can. 
 Most of these are internal, but a few are external dependencies.

'''Current Situation''':  The feature may be totally non-existent, already 
partially addressed in current pig, under design, or under development.

'''Estimated Development Effort''':
 * small:  less than 1 person month
 * medium:  1-3 person months
 * large:  more than 3 person months

'''Urgency''':
 * low:  only requested by a few users
 * medium:  requested by many users or adds significant value to the language
 * high:  requested ubiqitously, adds tremendous value to the language, or 
enables a number of other changes

[[Anchor(Describe_Schema)]]
=== Describe Schema ===
'''Explanation''':  Users need to be able to see the schema of data they wish 
to query.  This includes finding the schema of data that is not yet being
queried (e.g. `describe '/user/me/mydata'`) and determining the schema of a 
relation after several pig operations, e.g.:
{{{
a = load('/user/me/mydata') as (query, urls, user_clicked);
b = filter a by user_clicked >= '1';
c = group b by query;
describe c;
}}}

Also, this should allow the user to see the schema that will result from 
merging several files, e.g.
{{{
a = load('/user/me/mydata/*')
describe a;
}}}
should return the merged schema of all the files in the directory 
`/user/me/mydata`.

'''Rational''':  Goal 2.

'''Category''':  User experience.

'''Requestor''':  Users, Development. 

'''Depends On''': [#Metadata Metadata]

'''Current Situation''':  As of pig 1.2 the second type of describe (schema of 
a current relation) can be done.  Describing the schema of a file cannot be
done until metadata about the file exists.

'''Estimated Development Effort''':  Small.

'''Urgency''':  Medium.

[[Anchor(Documentation)]]
=== Documentation ===
'''Explanation''':  Pig users need comprehensive user documentation that will 
help them write queries, write user defined functions, submit their pig scripts
to the cluster, etc.  

'''Rational''':  Goal 1.

'''Requestor''':  Users, Development. 

'''Category''':  User experience.

'''Depends On''':  

'''Current Situation''':  There are a few existing wiki pages at 
http://wiki.apache.org/pig/. However, there is not enough at this time to allow 
users to get started
with pig, and what there is is not collected into a coherent whole with an 
index etc.

'''Estimated Development Effort''':  Medium.  This effort requires expertise 
not currently existing in the Pig development team.  A
document writer is needed to assist the team in this.

'''Urgency''':  High.

[[Anchor(Early_Error_Detection_and_Failure)]]
=== Early Error Detection and Failure ===
'''Explanation''':  Errors should be detected in pig queries at the earliest 
possible opportunity.  This includes syntax parsing that should be done
before pig connects to HOD at all, and type checking that should be done before 
the jobs are submitted to Map Reduce.

'''Rational''':  Goal 1.

'''Category''':  User experience.

'''Requestor''':  Development

'''Depends On''':  Some types of checking will require [#Metadata Metadata], 
others could be accomplished without it.

'''Current Situation''':  Currently a number of errors are not detected until 
runtime.

'''Estimated Development Effort''':  Small.

'''Urgency''':  Medium.

[[Anchor(Error_Handling)]]
=== Error Handling ===
'''Explanation''':    Pig needs to divide errors into categories of:
   * Warning:  Something went wrong but it is not a significant enough issue to 
stop execution, though a particular data value may be lost (and replaced by a 
NULL).  For example, would be used in a divide by 0 error.
   * Error:  Something went wrong and the current task will be aborted, but the 
query as a whole should continue and the task should be retried.  For example, 
a resource was temporarily unavaible. 
   * Fatal:  Something went wrong and there is no reason to attempt to finish 
executing the query.  For example, the requested data file cannot be found.

'''Rational''':  Goal 1.

'''Category''':  Production quality.

'''Requestor''':  Development.

'''Depends On''':  [#NULL_Support NULL Support]

'''Current Situation''':  Most errors encountered during map reduce execution 
cause the system to immediately abandon the query (e.g. a divide by zero
error in a query).

'''Estimated Development Effort''':  Medium.

'''Urgency''':  Medium.

[[Anchor(Explain_Plan_Support)]]
=== Explain Plan Support ===
'''Explanation''':  Users want to be able to see how their query will be 
executed.  This helps them understand how changes in their queries will effect 
the execution.

'''Rational''':  Goal 2.

'''Category''':  User experience.

'''Requestor''':  Users

'''Depends On''':  

'''Current Situation''':  No explanation is currently available for query 
execution.  The user can read the output of pig to see how many times map 
reduce was invoked,
but even this does not make clear which operations are being done in each step.

'''Estimated Development Effort''':  Small.

'''Urgency''':  Low.

[[Anchor(Removing_Automatic_String_Encoding)]]
=== Removing Automatic String Encoding ===
'''Explanation''':  Currently all data that pig reads from the grid is assumed 
to be strings encoded in UTF8.  However, some of the data stored in the grid,
 is not 100% UTF8.  Furthermore, the encoding is not provided within the data, 
and the strings in question
are too short to use inductive algorithms to determine the encoding.  Pig (and 
any other user) therefore has no way to determine the encoding.  However,
since pig currently assumes the data is UTF8 encoded it corrupts the data by 
converting it improperly.

The are two partial solutions to this.  One is to provide users with a C string 
like data type that provides some basic operators (equality, less than,
maybe very limited regular expressions) that does not attempt any intepretation 
or translation of the underlying data.  This will be done as part of the
types work to create a byte data type.   The second solution is, even for full 
string types, to only translate the underlying data type to java
strings when it is actually required, instead of by default.  This should be 
done for performance reasons, as translation is expensive.

'''Rational''':  Goal 1.

'''Category''':  Production quality.

'''Requestor''':  Development

'''Depends On''':  [#Types_Beyond_String Types Beyond String]

'''Current Situation''':  See Explanation.

'''Estimated Development Effort''':  Small, as most of the work will be done in 
"Types Beyond String".

'''Urgency''':  Medium.

[[Anchor(Language_Expansion)]]
=== Language Expansion ===
'''Explanation''':  Users have requested a number of extensions to the 
language.  Some of these are new functionality, some just extensions of 
exsiting funcitonality.
Requested extensions are:
   1. CASE statement.  This would generalize the currently available binary 
condition operator (`? :`).
   2. Extend operators allowed inside `FOREACH { ... GENERATE }`.  Users have 
requested that GROUP, COGROUP, and JOIN be added.
   3. Allow user to specify that a sort should be done in descending order 
instead of the default ascending order.  This should be extended to the SQL 
functionality of specifying ascending or descending for each field, e.g. `ORDER 
a BY $0 ASCENDING, $1 DESCENDING`.
   5. Add LIMIT functionality.  

'''Rational''':  Goal 1.

'''Category''':  User experience.

'''Requestor''':  Users

'''Depends On''':  

'''Current Situation''':  None of the above requested extensions currently 
exist.

'''Estimated Development Effort''':  Medium for all of the above, but note that 
any one of the above could be added without the others.  Taken alone, each of 
the above
are Small.

'''Urgency''':  Low.

[[Anchor(Logging)]]
=== Logging ===
'''Explanation''':  Pig needs a consistent, configurable way to write query 
progress, debugging, and error information both to the user and to logs.

'''Rational''':  Goal 1.

'''Category''':  User experience and Engineering manageability.

'''Requestor''':  Development.

'''Depends On''':  

'''Current Situation''':  Most pig progress, debugging, and error information 
is currently written to stderr via System.err.println.  Hadoop uses log4j
to provide trace logging.  Recently (as of pig 1.1d), pig began to use log4j as 
well.  However, very few parts of the code have been converted from
println to log4j.  In addition, pig only has a screen appender for log4j, so it 
only writes to stderr.  Pig needs to be generating a log file on the
front end as well.

'''Estimated Development Effort''':  Medium.

'''Urgency''':  Medium.

[[Anchor(Metadata)]]
=== Metadata ===
'''Explanation''':  Pig needs metadata to do the following:
   1. Provide a way for users to understand the data they want to query.  For 
example, users should be able to say `DESCRIBE '/user/me/mydata'` and get back 
a list of (minimally) fields and their types.
   2. Do type checking to find situations where users issue queries that are 
not semantically meaningful (such as dividing by a string).
   3. Allow performance optimizations such as:
      * performing arithmetic operations (e.g. SUM) as a long if the underlying 
type is an int or long, instead of as a double.
      * allowing pig to load and store numeric types as numerics rather than 
requiring a conversion from string.
   4. Provide a way to describe to a user defined function the format of the 
data being passed to the function.
   5. Allow sorting via numeric in addition to lexical order.

The decision has been made to use the Jute interface, provided by hadoop, to 
describe metadata.  This will allow pig to interact with other hadoop tools.  
It will
also free the pig team from needing to development their own metadata 
management library.

The goal is not to require metadata in pig.  Pig will retain the flexibility to 
work on unspecified input data.

It remains an open question on how pig should handle the case where the data it 
is provided does not match the metadata specification.  The default should be to
produce a warning for each non-conforming row.  It remains to be determined if 
there is a use for a "strict" mode where a non-conforming row would cause a 
query
stopping failure.

Note that this metadata only refers to metadata local to a file.  It does not
refer to global metadata such as table names, available UDFs, etc.

'''Rational''':  Goals 2, 3, 4.

'''Category''':  Infrastructure.

'''Requestor''':  Everyone.

'''Depends On''':  [#Types_Beyond_String Types Beyond String], some changes to 
Jute.

'''Current Situation''':  No metadata is available concerning data stored in 
the grid.

'''Estimated Development Effort''':  Large.

'''Urgency''':  High.

[[Anchor(NULL_Support)]]
=== NULL Support ===
'''Explanation''':  Pig needs to be able to support NULLs in its data for the 
following reasons:
   1. Some of its input data has NULL values in it, and users want to be able 
to act on this NULL data (e.g. filter it out via IS NOT NULL).
   2. Function and expression evaluations sometimes are unable to return a 
value, but execution of the query should not stop (e.g. divide by zero error).  
In this case the evaluation needs to return a NULL value to place in the field.

This requires that Jute support NULL values in data stored in its format.

This will require additions to the language, namely the ability to filter on IS 
(NOT) NULL.  It will also require that user defined functions and expression
evaluators determine how they will
interact with NULLs.  Functions that are similar to SQL functions (such as 
COUNT, SUM, etc.) and arithmetic expression operators should behave in a way 
consistent
with SQL standards (even though SQL standards themselves are inconsistent) in 
order to avoid violating the law of least astonishment.

'''Rational''':  Goal 1.

'''Category''':  Infrastructure.

'''Requestor''':  Development.

'''Depends On''':  Jute NULL support.

'''Current Situation''':  There is no concept of NULL in pig at this time.

'''Estimated Development Effort''':  Medium.

'''Urgency''':  High.

[[Anchor(Parameterized_Queries)]]
=== Parameterized Queries ===
'''Explanation''':  Users would like the ability to define parameters in a pig 
query.  When the query is invoked, values for those parameters would be defined 
and they
would then be used in execution of the query.  For example:
{{{
a = load '/data/mydata/@date@';
b = load '@latest_bot_filter@';
...
}}}

The query above could then be invoked, providing values for 'date' and 
'latest_bot_filter'.

'''Rational''':  Goal 1.

'''Category''':  User experience.

'''Requestor''':  Users

'''Depends On''':  

'''Current Situation''':  No support for this is available.

'''Estimated Development Effort''':  Small.

'''Urgency''':  Low.

[[Anchor(Performance)]]
=== Performance ===
'''Explanation''':  Pig needs to run, as close as is feesible in a generic 
language, near the same performance that could be obtained by a programmer
developing an application directly in map reduce.  There are multiple possible 
routes to take in performance enhancement:
   1. Query optimization via relational operator reordering and substitution.  
For example, pushing filters as far up the execution plan as possible, pushing 
expression evaluation as late as possible, etc.
   2. In some instances it is possible to execute multiple relational operators 
in the same map or reduce.  In some cases pig is already doing this (e.g. 
multiple filters are pipelined into one map operation).  There are additional 
cases that pig is not taking advantage of that it could (for example if a join 
is followed immediately by an aggregation on the same key, both could be placed 
in the same reduce).
   4. After the map stage, the map reduce system sorts the data according to 
keys specified by the job requester.  This key comparison is done via a class 
specified by the job requester.  Currently, this class is instantiated on every 
key comparison.  This means that the constructor for this class is called at 
least nlogn times.  Hadoop provides a byte comparitor option where the sequence 
of bytes in the key are compared directly rather than through an instantiated 
class.  Whereever possible pig needs to make use of this comparitor.
   5. Hadoop provides the ability to run a mini-reduce stage (called the 
combine stage) after the map and before data is shuffled to new processes on 
the reduce.  For certain algebraic functions (such as SUM and COUNT) the amount 
of data to be shipped between map and reduce could be reduced (sometimes 
greatly) by running the reduce step first in the combine, and then again in 
reduce.
   6. The code as it exists today can be instrumented and tests done to 
determine where the most time is spent.  This information can then be used to 
optimize heavily used areas of the code that are taking significant amounts of 
time.
   7. Split performance is very poor, and needs to be optimized.
   8. Currently, given a line of pig like `foreach A generate group, COUNT($0), 
SUM($1)`, pig will run over `A` twice, once for `COUNT` and once for `SUM`.  It 
should be able to compute both in a single pass.

'''Rational''':  Goal 3.

'''Category''':  Performance.

'''Requestor''':  Everyone.

'''Depends On''':  [#Metadata Metadata] for some optimizations

'''Current Situation''':  See Explanation.

'''Estimated Development Effort''':  
   1. Large
   2. Medium
   4. Small
   5. Small
   6. Small
   7. ?
   8. ?

'''Urgency''':  High for 4, 5, and 6, Medium for the rest.

[[Anchor(Query_Optimization)]]
=== Query Optimization ===
See [#Performance Performance]

[[Anchor(Shell_Support)]]
=== Shell Support ===
'''Explanation''':  There are three main goals of a pig shell:
   1. The ability to interact in a shell with the HDFS.
   2. The ability to interact in a shell with the user's machine.  This would 
allow the user to edit local files, etc.
   3. The ability to write pig scripts in an interactive shell (similar to perl 
or python shells).

Items 1 and 2 are similar, the only difference being which file system and OS 
they interact with.  Item 3 is something entirely different, though both types
of items are called shells.  Item 3 is an interactive programming environment.  
For this reason I do not think it makes sense to attempt to combine the
two shells.  Instead I propose that two shells be developed:

'''hsh''' (hadoop shell) - This shell will provide shell level interaction with 
HDFS.  In addition to the current operators of cat, cp, kill, ls, mkdir, mv,
pwd, rm it may be necessary to add chmod (based on hadoop implementation of 
permissions), chown and chgrp (based on hadoop implementation of ownership),
df, ds, ln (if hadoop someday supports links), and touch.  This shell will also 
support access to the user's local machine and all of the available shell
commands.  Ideally the shell would even support applying standard shell file 
tools that access files linearly (grep, more, etc.) to HDFS files, but this
may be a later addition.  Separate commands would not be required for cp, rm 
etc. based on which file system.  The shell should be able to determine that
based on the path.  To accomplish this, HDFS files would have /hdfs/CLUSTER 
(where CLUSTER is the name of the cluster, e.g. kryptonite) as the base of
their path.  hsh would then need to intercept shell commands such as rm, and 
determine whether to execute in HDFS or the local file system, based on the
file being manipulated.  For example, if a user wanted to create a file in his 
local directory on the gateway machine and then copy it to
the HDFS, he could then do:
{{{
hsh> vi myfile
hsh> cp myfile /hdfs/kryptonite/user/me/myfile
}}}
The invocation of vi would call vi on the local machine and edit a file in the 
current working directory on his local box.  The copy would move the file
from the user's local directory on the gateway machine to his directory on the 
HDFS.

hsh will support standard command line editing (and ideally tab completion).  
This should be done via third party library such as jline.

'''grunt''' - This shell will provide interactive pig scripting, similar to the 
way it does today.  It will no longer support HDFS file manipulation commands
(such as mv, rm, etc.).  It will also be extended to support inline scripting 
in either perl or python.  (Which one should be used is not clear.  The
research team votes for python.  User input is needed to decide if more users 
would prefer python or perl.) This will enable two important features:
   1. Embedding pig in a scripting language to give it control flow and 
branching structures.
   2. Allow on the fly development of user defined functions by creating a 
function native to the scripting language and then referencing it in the pig 
script.

The grunt shell will be accessible from hsh by typing `grunt`.  This will put 
the user in the interactive shell, in much the same way that typing `python`
on a unix box puts the user in an interactive python development shell.

'''Rational''':  Goals 4 and 6.

'''Category''':  Infrastructure.

'''Requestor''':  Development, Users

'''Depends On''':  

'''Current Situation''':  Currently grunt provides very limited commands (ls, 
mkdir, mv, cp, cat, rm, kill) for the HDFS.  It also provides an interactive
shell for generating pig scripts on the fly (similar to perl or python 
interactive shells).  No command line history or editing are provided.  No 
connection
with the users local file system is supported.

'''Estimated Development Effort''':  Large.

'''Urgency''':  Medium.

[[Anchor(SQL_Support)]]
=== SQL Support ===
'''Explanation''':  There is a large base of SQL users.   Many of these users 
will prefer to query data in SQL rather than learn pig.  To accomodate these 
users and
increase the adoption of hadoop, we could provide a SQL layer on top of pig.  
From pig's perspective, the easiest way to address this it to directly 
translate SQL
to pig, and then execute the resulting pig script.

'''Rational''':  Goal 4.

'''Category''':  User experience.

'''Requestor''':  Users

'''Depends On''':  [#Metadata Metadata]

'''Current Situation''':  See Explanation.

'''Estimated Development Effort''':  Large.

'''Urgency''':  Medium.


[[Anchor(Statistics_on_Pig_Usage)]]
=== Statistics on Pig Usage ===
'''Explanation''':  Pig should record statistics on its usage so that pig 
developers and grid administrators can monitor and understand pig usage.  These
statistics need to be collected in a way that does not compromise the security 
of data that is being queried (i.e. they cannot store results of the
queries).  They should however contain: 
   * files loaded
   * files created
   * queries submitted (including the text of the queries)
   * time the query took to execute 
   * number of rows input to the query
   * number of rows output by the query
   * size of intermediate results created by the query
   * error and warning count
   * error and warning types and messages
   * status result of the query (did it succeed, fail, was interupted, etc.)
   * user who executed the query.
   * operations performed by the query (e.g. did the query include a filter, a 
group, etc.)  This can be used by developers to search for all queries that 
include an aggregation, etc.

For security reasons, these statitics will need to be kept inside the grid they 
are generated on, and only be accessable to cluster administrators and
pig developers.

'''Rational''':  Goal 7.  This also assists developers in meeting goals 1 and 3 
because it allows them to determine the quality and performance of the user
experience.

'''Category''':  Engineering manageability.

'''Requestor''':  Administrators

'''Depends On''':  

'''Current Situation''':  There are no usage statistics available.

'''Estimated Development Effort''':  Medium.

'''Urgency''':  Medium.

[[Anchor(Stream_Support)]]
=== Stream Support ===
'''Explanation''':  Hadoop supports running HDFS files through an arbitrary 
executable, such as grep or a user provided program.  This is referred to as
streaming.  There exists already a base of user programs used in this way.  
Users have expressed an interest in integrating these with pig so that they
can take advantage of pig's parallel structure and relational operators but 
still use their hand crafted programs to express their business logic.  For
example (note:  syntax and semantics still TBD)
{{{
a = load '/user/me/mydata';
b = filter a by $0 matches "^[a-zA-Z]\.yahoo.com";
c = stream b through countProperties;
...
}}}

'''Rational''':  Goal 4.  

'''Category''':  Infrastructure.

'''Requestor''': Users

'''Depends On''':  

'''Current Situation''':  No support for streaming is currently available.

'''Estimated Development Effort''':  Medium.

'''Urgency''':  Medium (while it has only been requested by a couple of users 
to date, we believe it will open up pig usage to a number of users and
therefore is more desirable).

[[Anchor(Test_Framework)]]
=== Test Framework ===
'''Explanation''':  Data querying systems like pig have unique functional 
testing requirements.  For unit tests, junit is an excellent tool.  But for
function tests, as arbitrary queries of potentially large amounts of data need 
to be
added on a regular basis it is not feasible to use a testing system like junit 
that assumes the test implementer knows the correct result for the test
when the test is implemented.  We need a tool that will allow the 
tester to designate a source of truth for the test, and then generate the 
expected results for that test from that source of truth.  For example,
for small functional tests a database could be set up with the same data as in 
the grid.  The tester would then write a pig query and an equivalent SQL query, 
and
the test harness would run both and compare the results.

In addition to the above, a set of performance tests need to be created to 
allow developers to test pig performance.  It should be possible to develop
these in the same framework as is used for the functional tests.

'''Rational''':  Goal 1.

'''Category''':  Engineering manageability.

'''Requestor''':  Development.

'''Depends On''':  

'''Current Situation''':  Pig has unit tests in junit.

'''Estimated Development Effort''':  Medium.

'''Urgency''':  High (quality testing that is easy for developers to implements 
facilitates better and faster development of features).

[[Anchor(Test_Integration_with_Hudson)]]
=== Test Integration with Hudson ===
'''Explanation''':  Currently hadoop uses hudson as its nightly build 
environment.  This includes checking out, building, and
running unit tests.  It is also used to test new patches that are submitted to 
jira.  Pig needs to be integrated with hudson in this same way.

Also, there are a set of tools that are regularly applied to other hadoop 
nightly builds (Findbugs and code coverage metrics) via hudson.
These need to be applied to pig as well.

'''Rational''':  Goal 1.

'''Category''':  Engineering manageability.

'''Requestor''':  QA.

'''Depends On''':  

'''Current Situation''':  See Explanation

'''Estimated Development Effort''':  Medium.

'''Urgency''':  High (frees committers from manually testing patches).

[[Anchor(Types_Beyond_String)]]
=== Types Beyond String ===
'''Explanation''':  For performance and usability reasons pig needs to support 
atomic data types beyond strings.  Based on data types supported in jute and
standard data types supported by most data processing systems, these types will 
be:
   * int (32 bit integer)
   * long (64 bit integer)
   * float (32 bit floating point number)
   * double (64 bit floating point number)
   * string (already supported)
   * byte (binary data) - Should this be called cstring, or bstring or 
something?  We want to provide some basic operators for it, so it isn't simply 
a blob or binary type.  But it isn't a full fledged java string either.
   * user defined types

Supporting these types in a native format will allow performance gains in 
reading and writing data to disk and in key comparison for sorting, grouping,
etc.  It will also avoid requiring string->number conversions for every row 
during numeric expression evaluation.  It will also allow sorting in numeric
order rather than requiring that all sorts be in lexical order, as is currently 
the case.

User defined types will require the user to provide a set of functions to:
   * load and store the object
   * convert the object to a java string.

Optionally, a user could provide functions that do
   * less than inequality comparison - this would be required to do sorting and 
filter evaluations
   * for optimization, the full set of less than equal to, equal, not equal, 
greater than, greater than equal
   * matches (regular expression)
   * arithmetic operatons on the types (`+ - * /`)

'''Rational''':  Goals 1 and 3.

'''Category''':  Infrastructure.

'''Requestor''':  Development.

'''Depends On''': 

'''Current Situation''':  A functional spec for changes to types has been 
proposed, see http://wiki.apache.org/pig/PigTypesFunctionalSpec

'''Estimated Development Effort''':  Large.

'''Urgency''':  High

[[Anchor(User_Defined_Function_Support_in_Other_Language)]]
=== User Defined Function Support in Other Language ===
'''Explanation''':  Many potential pig users are not java programmers.  Many 
prefer to program in C++, perl, python, etc. 
Pig needs to allow users to program in their language of preference. 

This requires that pig provide APIs so that users can write code in these other 
languages and use it as part of their query in a way similar to what is
done with java functions today.  Java functions can be implemented as eval 
functions, filter functions, or load and storage functions.  At this point
there is not a perceived need for load and storage functions in languages 
beyond java.  But eval and filter functions should be supported.

Perl, python, and C++ have been chosen as the languages to be supported because 
those are the languages most used by the user community at this time.

'''Rational''':  Goal 4.

'''Category''':  Infrastructure.

'''Requestor''':  Everyone.

'''Depends On''':  

'''Current Situation''':  Pig currently supports user defined functions in java.

'''Estimated Development Effort''':  Medium.

'''Urgency''':  High.

-----
-----
[[Anchor(Addenda)]]
== Addenda  ==
The following thoughts were added after the roadmap had been published and 
reviewed.

[[Anchor(Execution_that_supports_restart)]]
=== Execution that supports restart ===

We discussed the need for something that takes a Pig execution plan and 
executes it in a way the user can easily observe and comprehend.  If a job 
breaks the user
gets a HDFS directory with all of the completed sub-job state and enough info 
that the user can repair his Pig script or the data and restart.  This is 
needed for
really long jobs.

[[Anchor(Global_Metadata)]]
=== Global Metadata ===
Need to consider where/how/if pig will store global metadata.  Items that
would go here would include table->file mapping (to support such things as
show tables), UDFs available on the cluster, UDTs available on the cluster,
etc.

[Pig Wiki] Update of "ProposedRoadMap" by AlanGates

Reply via email to