[Pig Wiki] Update of GettingStarted by JeffHammerbacher
Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The following page has been changed by JeffHammerbacher: http://wiki.apache.org/pig/GettingStarted -- 4. To run pig programs, you need access to a '''Hadoop cluster''': [http://lucene.apache.org/hadoop/]. It's also possible to run pig in local mode, with severely limited performance - this mode doesn't require setting up a Hadoop cluster. == Building Pig == + + [Jeff] This needs to be changed, since 0.15 clusters are not supported any longer 1. Check out pig code from svn: `svn co http://svn.apache.org/repos/asf/incubator/pig/trunk`. 2. Build the code from the top directory: `ant`. If the build is successful, you should see `pig.jar` created in that directory.
[Pig Wiki] Update of PigMetaData by AlanGates
Dear Wiki user, You have subscribed to a wiki page or wiki category on Pig Wiki for change notification. The following page has been changed by AlanGates: http://wiki.apache.org/pig/PigMetaData New page: = Using Metadata in Pig = One of the guiding philosophies of pig is that Pigs eat anything. While this is true and will remain true, this does not preclude pig from selecting better food when it is available. In this vein, pig should make use of metadata when it is available, but continue to work well in situations where it is not available. This wiki is written assuming the functionality of the pipeline rework [http://issues.apache.org/jira/browse/PIG-157]. This has not yet been committed to the trunk, but should be some time in the July of 2008. == Definition of Metadata == For the purpose of this discussion, metadata will be divided into two categories, global and file specific. Global metadata records information about the system as a whole. File specific metadata records information about a particular file, or possibly a set of files in a directory. This can include schema information, histograms, etc. == Pig Interface to File Specific Metadata == Pig should support four options with regard to file specific metadata: 1. No file specific metadata available. Pig uses the file as input with no knowledge of its content. All data is assumed to be !ByteArrays. 2. User provides schema in the script. For example, `A = load 'myfile' as (a: chararray, b: int);`. 3. Self describing data. Data may be in a format that describes the schema, such as JSON. Users may also have other proprietary ways to store information about the data in a file either in the file itself or in an associated file. Changes to the !LoadFunc interface made as part of the pipeline rework support this for data type and column layout only. It will need to be expanded to support other types of information about the file. 4. Input from a data catalog. Pig needs to be able to query an external data catalog to acquire information about a file. All the same information available in option 3 should be available via this interface. This interface does not yet exist and needs to be designed. == Pig Interface to Global Metadata == An interface will need to be designed for pig to interface to an external data catalog. == Architecture of Pig Interface to External Data Catalog == Pig needs to be able to connect to various types of external data catalogs (databases, catalogs stored in flat files, web services, etc.). To facilitate this pig will develop a generic interface that allows it to make specific types of queries to a data catalog. Drivers will then need to be written to implement that interface and connect to a specific type of data catalog. == Types of File Specific Metadata Pig Will Use == Pig should be able to acquire the following types of information about a file via either self description or an external data catalog. This is not to say that every self describing file or external data catalog must support every one of these items. This is a list of items pig may find useful and should be able to query for. If the metadata source cannot provide the information, pig will simply not make use of it. * Field layout (already supported) * Field types (already supported) * Sortedness of the data, both key and direction (ascending/descending) * How file is partitioned, both partition field and hashing function * Number of records * File size * Cardinality of a given field * Histogram of values in a given field * Does a field allow NULLs * Default values for a field Others? == Type of Global Metadata Pig Will Use == Pig should be able to acquire the following types of global information from an external data catalog. This is not to say that every external data catalog must support every one of these items. This is a list of items pig may find useful and should be able to query for. If the metadata source cannot provide the information, pig will simply not make use of it. * System resources available (not clear, we may be wandering too close to scheduler functionality here) == Priorities == Given that the usage for global metadata is unclear, the priority will be placed on supporting file specific metadata. The first step should be to define the interface changes in !LoadFunc and the interface to external data catalogs.
svn commit: r659689 - in /incubator/pig/trunk: ./ src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/ src/org/apache/pig/data/ src/org/apache/pig/impl/logicalLayer/parser/ test/org/apache
Author: olga Date: Fri May 23 15:18:16 2008 New Revision: 659689 URL: http://svn.apache.org/viewvc?rev=659689view=rev Log: PIG-85: allowing control characters as field delimiters in PigStorage Modified: incubator/pig/trunk/CHANGES.txt incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/MapReduceLauncher.java incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/PigOutputFormat.java incubator/pig/trunk/src/org/apache/pig/data/Tuple.java incubator/pig/trunk/src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt incubator/pig/trunk/test/org/apache/pig/test/TestStore.java Modified: incubator/pig/trunk/CHANGES.txt URL: http://svn.apache.org/viewvc/incubator/pig/trunk/CHANGES.txt?rev=659689r1=659688r2=659689view=diff == --- incubator/pig/trunk/CHANGES.txt (original) +++ incubator/pig/trunk/CHANGES.txt Fri May 23 15:18:16 2008 @@ -297,3 +297,5 @@ PIG-236: Fix properties so that values specified via the command line (-D) are not ignored (pkamath via gates). PIG-198: integration with hadoop 17 + +PIG-85: allowing control characters as delimiters for PigStorage Modified: incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/MapReduceLauncher.java URL: http://svn.apache.org/viewvc/incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/MapReduceLauncher.java?rev=659689r1=659688r2=659689view=diff == --- incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/MapReduceLauncher.java (original) +++ incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/MapReduceLauncher.java Fri May 23 15:18:16 2008 @@ -216,7 +216,8 @@ conf.set(pig.inputs, ObjectSerializer.serialize(pom.inputFileSpecs)); conf.setOutputPath(new Path(pom.outputFileSpec.getFileName())); -conf.set(pig.storeFunc, pom.outputFileSpec.getFuncSpec()); +conf.set(pig.storeFunc, + ObjectSerializer.serialize(pom.outputFileSpec.getFuncSpec())); // Setup the DistributedCache for this job setupDistributedCache(pom.pigContext, conf, pom.properties, Modified: incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/PigOutputFormat.java URL: http://svn.apache.org/viewvc/incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/PigOutputFormat.java?rev=659689r1=659688r2=659689view=diff == --- incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/PigOutputFormat.java (original) +++ incubator/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapreduceExec/PigOutputFormat.java Fri May 23 15:18:16 2008 @@ -35,6 +35,7 @@ import org.apache.pig.builtin.PigStorage; import org.apache.pig.data.Tuple; import org.apache.pig.impl.PigContext; +import org.apache.pig.impl.util.ObjectSerializer; import org.apache.tools.bzip2r.BZip2Constants; import org.apache.tools.bzip2r.CBZip2OutputStream; @@ -51,7 +52,7 @@ public PigRecordWriter getRecordWriter(FileSystem fs, JobConf job, Path outputDir, String name, Progressable progress) throws IOException { StoreFunc store; -String storeFunc = job.get(pig.storeFunc, ); +String storeFunc = (String) ObjectSerializer.deserialize(job.get(pig.storeFunc, )) ; if (storeFunc.length() == 0) { store = new PigStorage(); } else { Modified: incubator/pig/trunk/src/org/apache/pig/data/Tuple.java URL: http://svn.apache.org/viewvc/incubator/pig/trunk/src/org/apache/pig/data/Tuple.java?rev=659689r1=659688r2=659689view=diff == --- incubator/pig/trunk/src/org/apache/pig/data/Tuple.java (original) +++ incubator/pig/trunk/src/org/apache/pig/data/Tuple.java Fri May 23 15:18:16 2008 @@ -69,18 +69,32 @@ * * @param textLine *the line containing fields of data - * @param delimiter - *a regular expression of the form specified by String.split(). If null, the default - *delimiter [,\t] will be used. + * @param delimiter + * the delimiter (normal string, NO REGEX!!) */ public Tuple(String textLine, String delimiter) { if (delimiter == null) { delimiter = defaultDelimiter; } -String[] splitString = textLine.split(delimiter, -1); -fields = new ArrayListDatum(splitString.length); -for (int i = 0; i splitString.length; i++) { -fields.add(new DataAtom(splitString[i])); + +
svn commit: r659695 - in /incubator/pig/branches/types: src/org/apache/pig/FilterFunc.java src/org/apache/pig/builtin/IsEmpty.java src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt test/org/
Author: gates Date: Fri May 23 15:31:28 2008 New Revision: 659695 URL: http://svn.apache.org/viewvc?rev=659695view=rev Log: PIG-159 Santhosh's fix to bug that prevented instantiation of UDFs. Modified: incubator/pig/branches/types/src/org/apache/pig/FilterFunc.java incubator/pig/branches/types/src/org/apache/pig/builtin/IsEmpty.java incubator/pig/branches/types/src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt incubator/pig/branches/types/test/org/apache/pig/test/TestLogicalPlanBuilder.java Modified: incubator/pig/branches/types/src/org/apache/pig/FilterFunc.java URL: http://svn.apache.org/viewvc/incubator/pig/branches/types/src/org/apache/pig/FilterFunc.java?rev=659695r1=659694r2=659695view=diff == --- incubator/pig/branches/types/src/org/apache/pig/FilterFunc.java (original) +++ incubator/pig/branches/types/src/org/apache/pig/FilterFunc.java Fri May 23 15:31:28 2008 @@ -22,19 +22,7 @@ import org.apache.pig.data.Tuple; -public abstract class FilterFunc { - -/** - * This callback method must be implemented by all subclasses. This - * is the method that will be invoked on every Tuple of a given dataset. - * Since the dataset may be divided up in a variety of ways the programmer - * should not make assumptions about state that is maintained between - * invocations of this method. - * - * @param input the Tuple to be processed. - * @throws IOException - */ -abstract public boolean exec(Tuple input) throws IOException; +public abstract class FilterFunc extends EvalFuncBoolean { /** * Placeholder for cleanup to be performed at the end. User defined functions can override. Modified: incubator/pig/branches/types/src/org/apache/pig/builtin/IsEmpty.java URL: http://svn.apache.org/viewvc/incubator/pig/branches/types/src/org/apache/pig/builtin/IsEmpty.java?rev=659695r1=659694r2=659695view=diff == --- incubator/pig/branches/types/src/org/apache/pig/builtin/IsEmpty.java (original) +++ incubator/pig/branches/types/src/org/apache/pig/builtin/IsEmpty.java Fri May 23 15:31:28 2008 @@ -30,7 +30,7 @@ public class IsEmpty extends FilterFunc { @Override -public boolean exec(Tuple input) throws IOException { +public Boolean exec(Tuple input) throws IOException { try { Object values = input.get(0); if (values instanceof DataBag) Modified: incubator/pig/branches/types/src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt URL: http://svn.apache.org/viewvc/incubator/pig/branches/types/src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt?rev=659695r1=659694r2=659695view=diff == --- incubator/pig/branches/types/src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt (original) +++ incubator/pig/branches/types/src/org/apache/pig/impl/logicalLayer/parser/QueryParser.jjt Fri May 23 15:31:28 2008 @@ -48,6 +48,7 @@ import org.apache.pig.data.Tuple; import org.apache.pig.data.BagFactory; import org.apache.pig.data.DataBag; +import org.apache.pig.EvalFunc; public class QueryParser { @@ -2048,13 +2049,11 @@ if(null == userFunc) { //TODO //Commented out the code for instaniateFunc as it's failing - /* try{ - LOUserFunc ef = (LOUserFunc) pigContext.instantiateFuncFromAlias(funcName); + EvalFunc ef = (EvalFunc) pigContext.instantiateFuncFromAlias(funcName); }catch (Exception e){ throw new ParseException(e.getMessage()); } - */ } log.trace(Exiting EvalFunction); Modified: incubator/pig/branches/types/test/org/apache/pig/test/TestLogicalPlanBuilder.java URL: http://svn.apache.org/viewvc/incubator/pig/branches/types/test/org/apache/pig/test/TestLogicalPlanBuilder.java?rev=659695r1=659694r2=659695view=diff == --- incubator/pig/branches/types/test/org/apache/pig/test/TestLogicalPlanBuilder.java (original) +++ incubator/pig/branches/types/test/org/apache/pig/test/TestLogicalPlanBuilder.java Fri May 23 15:31:28 2008 @@ -35,7 +35,7 @@ import org.apache.pig.LoadFunc; //TODO //Not able to include PigServer.java -//import org.apache.pig.PigServer; +import org.apache.pig.PigServer; import org.apache.pig.builtin.PigStorage; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; @@ -72,7 +72,7 @@ buildPlan(query); } -/* TODO FIX +// TODO FIX Query3 and Query4 @Test public void testQuery3() {
svn commit: r659721 - in /incubator/pig/branches/types: build.xml src/org/apache/pig/PigServer.java
Author: gates Date: Fri May 23 16:58:27 2008 New Revision: 659721 URL: http://svn.apache.org/viewvc?rev=659721view=rev Log: Fixes in PigServer to deal with change of alias from map string-logicalplan to logicaloperator-logicalplan. Added TestMapReduce to build.xml to begin end to end testing. It currently fails, but I'm leaving it in so we can test with it. Modified: incubator/pig/branches/types/build.xml incubator/pig/branches/types/src/org/apache/pig/PigServer.java Modified: incubator/pig/branches/types/build.xml URL: http://svn.apache.org/viewvc/incubator/pig/branches/types/build.xml?rev=659721r1=659720r2=659721view=diff == --- incubator/pig/branches/types/build.xml (original) +++ incubator/pig/branches/types/build.xml Fri May 23 16:58:27 2008 @@ -222,47 +222,6 @@ batchtest fork=yes todir=${test.log.dir} unless=testcase fileset dir=test -!-- -include name=**/TestBuiltin.java / -include name=**/TestOperatorPlan.java / - include name=**/TestPhyOp.java / - include name=**/TestConstExpr.java / - include name=**/TestProject.java / - include name=**/TestFilter.java / - include name=**/TestAdd.java / - include name=**/TestSubtract.java / - include name=**/TestMultiply.java / - include name=**/TestDivide.java / - include name=**/TestMod.java / - include name=**/TestGreaterThan.java / - include name=**/TestGTOrEqual.java / - include name=**/TestLessThan.java / - include name=**/TestLTOrEqual.java / - include name=**/TestEqualTo.java / - include name=**/TestNotEqualTo.java / - include name=**/TestPOGenerate.java / - include name=**/TestPOSort.java / - include name=**/TestPOUserFunc.java / - include name=**/TestPODistinct.java / - include name=**/TestLoad.java / - include name=**/TestStore.java / - include name=**/TestPackage.java / - include name=**/TestLocalRearrange.java / - include name=**/TestForEach.java / - include name=**/TestUnion.java / - include name=**/TestMRCompiler.java / - include name=**/TestJobSubmission.java / - include name=**/TestInputOutputFileValidator.java / - include name=**/TestTypeCheckingValidator.java / - include name=**/TestSchema.java / - include name=**/TestLogicalPlanBuilder.java / -include name=**/TestLocalJobSubmission.java / - include name=**/TestPOMapLookUp.java / - include name=**/TestPOBinCond.java / - include name=**/TestPONegative.java / - include name=**/TestGrunt.java / - include name=**/TestPOCast.java / --- include name=**/*Test*.java / !-- Excluced because they are end-to-end, don't work yet. -- exclude name=**/TestAlgebraicEval.java / @@ -272,7 +231,7 @@ exclude name=**/TestFilterOpNumeric.java / exclude name=**/TestFilterOpString.java / exclude name=**/TestInfixArithmetic.java / -exclude name=**/TestMapReduce.java / +!-- exclude name=**/TestMapReduce.java / -- exclude name=**/TestPigFile.java / exclude name=**/TestPigSplit.java / exclude name=**/TestStoreOld.java / Modified: incubator/pig/branches/types/src/org/apache/pig/PigServer.java URL: http://svn.apache.org/viewvc/incubator/pig/branches/types/src/org/apache/pig/PigServer.java?rev=659721r1=659720r2=659721view=diff == --- incubator/pig/branches/types/src/org/apache/pig/PigServer.java (original) +++ incubator/pig/branches/types/src/org/apache/pig/PigServer.java Fri May 23 16:58:27 2008 @@ -231,17 +231,13 @@ } public void dumpSchema(String alias) throws IOException{ -LogicalPlan lp = aliases.get(alias); -if (lp == null) -throw new IOException(Invalid alias - + alias); - try { +LogicalPlan lp = getPlanFromAlias(alias, describe); Schema schema = lp.getLeaves().get(0).getSchema();