Dear Wiki user,
You have subscribed to a wiki page or wiki category on Pig Wiki for change
notification.
The following page has been changed by CorinneC:
http://wiki.apache.org/pig/FAQ
--
- Pig FAQ
+ '''1. I'm using `PigStorage` to parse my input files. Can I make it use
control characters as delimiters?'''
- 1. I'm using PigStorage to parse my input files. Can I make it use control
characters as delimiters?
+ Yes. The first parameter to `PigStorage` is the dataset name, the second is a
regular expression to describe the delimiter. We used `String.split(regex, -1)`
to extract fields from lines. See
[http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html
java.util.regex.Pattern] for more information on the way to use special
characters in regex. For example,
- A. Yes. Examples: PigStorage('\u0001') for Ctrl+A or '\u007C' for this
character: |
+ {{{
+ LOAD 'input.dat' USING PigStorage('\u0001');
+ }}}
- 2. Can I do a numerical comparison while filtering?
+ will use `^A` as a delimiter.
- A. Yes, you can choose between numerical and string comparison. For numerical
comparison use the operators =, , etc. and for string comparisons use eq,
neq etc.
+ '''2. Can I do a numerical comparison while filtering?'''
- 3. How do I make my jobs run on multiple machines?
+ Yes, you can choose between numerical and string comparison. For numerical
comparison use the operators =, , etc. and for string comparisons use eq,
neq etc. See the format of [#CondS Conditions].
- A. Use the PARALLEL clause. For example =C = JOIN A by url, B by url PARALLEL
50=
+ '''3. How do I make my jobs run on multiple machines?'''
- 4. Does Pig support NULLs?
+ Use the `PARALLEL` clause:
- A. Pig currently has no support for NULL values but it is on the roadmap.
+ {{{
+ C = JOIN A by url, B by url PARALLEL 50;
+ }}}
- 5. Does pig support regular expressions?
+ '''4. I would like to use Pig to read a list of `.gz` files that use
`'\u0001'` as a delimiter. How do I do that?'''
- A. Pig does support regular expression matching via =matches= keyward. Tt
uses java.util.regexp matches which means your pattern has to match the entire
string (ie if your string is hi fred and you want to find fred you have to
give a pattern of .*fred not fred).
+ You can use the following load command:
+ {{{
+ LOAD 'input_file' USING PigStorage('\u0001');
+ }}}
+
+ '''5. Does Pig support NULLs?'''
+
+ Pig currently has no support for NULL values but it is on the roadmap.
+
+ '''6. Does Pig support regular expressions?'''
+
+ Pig does support regular expression matching via the `matches` keyword. It
uses
[http://java.sun.com/javase/6/docs/api/java/util/regex/package-summary.html
java.util.regex] matches which means your pattern has to match the entire
string (e.g. if your string is `hi fred` and you want to find `fred` you
have to give a pattern of `.*fred` not `fred`).
+
- 6. How to prevent failure if some records don't have the needed number of
columns.
+ '''7. How do I prevent failure if some records don't have the needed number
of columns?'''
You can filter away those records by including the following in your Pig
program:
-
+ {{{
- A = load 'foo' using PigStorage('\t');
+ A = LOAD 'foo' USING PigStorage('\t');
B = FILTER A BY ARITY(*) 5;
.
+ }}}
+ This code would drop all records that have fewer than five (5) columns.
- This code would drop all the records that has less than 5 columns.
+ '''8. Is there any difference between `==` and `eq` for numeric
comparisons?'''
- 7. Is there any difference between == and eq for numeric comparisons?
+ There is no difference when using integers. However, `11.0` and `11` will be
equal with `==` but not with `eq`.
- For equality, there is no difference while you stay in integers. However 11.0
and 11 will be equal with == but not with eq.
+ '''9. Is it possible to use PIG with a regular Hadoop cluster (not HOD) ?'''
+ You can set this property using the empty string.
+
+ {{{
+ hod.server=
+ }}}
+
- 8. Is there an easy way for me to figure out how many rows exists in a
dataset from its alias?
+ '''10. Is there an easy way for me to figure out how many rows exist in a
dataset from it's alias?'''
You can run the following set of commands:
+ {{{
+ a = LOAD 'bla' ... ;
+ b = GROUP a ALL;
+ c = FOREACH b GENERATE COUNT(a.$0);
+ }}}
- a = load 'bla' ... ;
+ This is equivalent to `SELECT COUNT(*)` in SQL.
- b = group a all;
+ '''11. Does Pig allow grouping on expressions?'''
- c = foreach b generate COUNT(a.$0);
+ Currently, Pig only allows grouping on data fields rather than expressions.
Allowing grouping on expressions is on our roadmap. Stay tuned!
+ '''12. Is there a way to check if a map is empty?'''
- This is equivalent to select count(*) in SQL.
+ Currently, there is no way to do that.
- 9. Does Pig allow