[jira] Updated: (PIG-1533) Compression codec should be a per-store property
[ https://issues.apache.org/jira/browse/PIG-1533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1533: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed > Compression codec should be a per-store property > > > Key: PIG-1533 > URL: https://issues.apache.org/jira/browse/PIG-1533 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1533.patch > > > The following script with multi-query optimization > {code} > a = load 'input'; > store a into 'outout.bz2'; > store a into 'outout2' > {code} > generates two .bz files, while only one of them should be compressed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1525) Incorrect data generated by diff of SUM
[ https://issues.apache.org/jira/browse/PIG-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1525: -- Attachment: PIG-1525.patch > Incorrect data generated by diff of SUM > --- > > Key: PIG-1525 > URL: https://issues.apache.org/jira/browse/PIG-1525 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1525.patch > > > Given data; > input1: > {code} > id9 0 > {code} > input2: > {code} > id8 1 > id9 1 > {code} > Pig script > {code} > A = LOAD 'input1' AS (id:chararray, val:long); > B = LOAD 'input2' AS (id:chararray, val:long); > C = COGROUP A BY id, B BY id; > D = FOREACH C GENERATE group, SUM(B.val), SUM(A.val), (SUM(A.val) - > SUM(B.val)); > dump D; > {code} > generates incorrect data: > {code} > (id8,1L,,) > (id9,1L,0L,-2L) > {code} > The workaround is to replace the FOREACH statement with > {code} > D = FOREACH C GENERATE group, SUM(B.val) as b, SUM(A.val) as a; > E = FOREACH D GENERATE $0, b, a, (a-b); > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1525) Incorrect data generated by diff of SUM
[ https://issues.apache.org/jira/browse/PIG-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1525: -- Status: Patch Available (was: Open) > Incorrect data generated by diff of SUM > --- > > Key: PIG-1525 > URL: https://issues.apache.org/jira/browse/PIG-1525 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1525.patch > > > Given data; > input1: > {code} > id9 0 > {code} > input2: > {code} > id8 1 > id9 1 > {code} > Pig script > {code} > A = LOAD 'input1' AS (id:chararray, val:long); > B = LOAD 'input2' AS (id:chararray, val:long); > C = COGROUP A BY id, B BY id; > D = FOREACH C GENERATE group, SUM(B.val), SUM(A.val), (SUM(A.val) - > SUM(B.val)); > dump D; > {code} > generates incorrect data: > {code} > (id8,1L,,) > (id9,1L,0L,-2L) > {code} > The workaround is to replace the FOREACH statement with > {code} > D = FOREACH C GENERATE group, SUM(B.val) as b, SUM(A.val) as a; > E = FOREACH D GENERATE $0, b, a, (a-b); > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1334) Make pig artifacts available through maven
[ https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12895802#action_12895802 ] Richard Ding commented on PIG-1334: --- I ran mvn-deploy target. It succeeded and the pig jar and other artifacts were deployed to {code} https://repository.apache.org/content/repositories/snapshots/org/apache/hadoop/pig/0.8.0-SNAPSHOT/ {code} Giri, can you review the new patch? > Make pig artifacts available through maven > -- > > Key: PIG-1334 > URL: https://issues.apache.org/jira/browse/PIG-1334 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, > mvn_pig_4.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1525) Incorrect data generated by diff of SUM
[ https://issues.apache.org/jira/browse/PIG-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896094#action_12896094 ] Richard Ding commented on PIG-1525: --- Results of running test-patch: {code} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. {code} > Incorrect data generated by diff of SUM > --- > > Key: PIG-1525 > URL: https://issues.apache.org/jira/browse/PIG-1525 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1525.patch > > > Given data; > input1: > {code} > id9 0 > {code} > input2: > {code} > id8 1 > id9 1 > {code} > Pig script > {code} > A = LOAD 'input1' AS (id:chararray, val:long); > B = LOAD 'input2' AS (id:chararray, val:long); > C = COGROUP A BY id, B BY id; > D = FOREACH C GENERATE group, SUM(B.val), SUM(A.val), (SUM(A.val) - > SUM(B.val)); > dump D; > {code} > generates incorrect data: > {code} > (id8,1L,,) > (id9,1L,0L,-2L) > {code} > The workaround is to replace the FOREACH statement with > {code} > D = FOREACH C GENERATE group, SUM(B.val) as b, SUM(A.val) as a; > E = FOREACH D GENERATE $0, b, a, (a-b); > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-103) Shared Job /tmp location should be configurable
[ https://issues.apache.org/jira/browse/PIG-103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896102#action_12896102 ] Richard Ding commented on PIG-103: -- The patch looks good. A couple of comments: * In FileLocalizer, it's better to call the getProperty {code} String tdir= pigContext.getProperties().getProperty("pig.temp.loc", "/tmp"); {code} from inside of the if-block so it only gets called when needed. * In the unit test, it world be good to verify the method {code} FileLocalizer.getTemporaryPath(PigContext pigContext) {code} returns the correct temp directory. > Shared Job /tmp location should be configurable > --- > > Key: PIG-103 > URL: https://issues.apache.org/jira/browse/PIG-103 > Project: Pig > Issue Type: Improvement > Components: impl > Environment: Partially shared file:// filesystem (eg NFS) >Reporter: Craig Macdonald >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: conf_tmp_dir.patch > > > Hello, > I'm investigating running pig in an environment where various parts of the > file:// filesystem are available on all nodes. I can tell hadoop to use a > file:// file system location for it's default, by seting > fs.default.name=file://path/to/shared/folder > However, this creates issues for Pig, as Pig writes it's job information in a > folder that it assumes is a shared FS (eg DFS). However, in this scenario > /tmp is not shared on each machine. > So /tmp should either be configurable, or Hadoop should tell you the actual > full location set in fs.default.name? > Straightforward solution is to make "/tmp/" a property in > src/org/apache/pig/impl/io/FileLocalizer.java init(PigContext) > Any suggestions of property names? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1525) Incorrect data generated by diff of SUM
[ https://issues.apache.org/jira/browse/PIG-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896121#action_12896121 ] Richard Ding commented on PIG-1525: --- It turns out that the problem also affects the conditional operator (BinCond). > Incorrect data generated by diff of SUM > --- > > Key: PIG-1525 > URL: https://issues.apache.org/jira/browse/PIG-1525 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1525.patch > > > Given data; > input1: > {code} > id9 0 > {code} > input2: > {code} > id8 1 > id9 1 > {code} > Pig script > {code} > A = LOAD 'input1' AS (id:chararray, val:long); > B = LOAD 'input2' AS (id:chararray, val:long); > C = COGROUP A BY id, B BY id; > D = FOREACH C GENERATE group, SUM(B.val), SUM(A.val), (SUM(A.val) - > SUM(B.val)); > dump D; > {code} > generates incorrect data: > {code} > (id8,1L,,) > (id9,1L,0L,-2L) > {code} > The workaround is to replace the FOREACH statement with > {code} > D = FOREACH C GENERATE group, SUM(B.val) as b, SUM(A.val) as a; > E = FOREACH D GENERATE $0, b, a, (a-b); > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1525) Incorrect data generated by diff of SUM
[ https://issues.apache.org/jira/browse/PIG-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896123#action_12896123 ] Richard Ding commented on PIG-1525: --- The cause is the interaction between Accumulator UDF and binary operators. In the failure cases, the state kept by Accumulator is not reset cross record boundaries. > Incorrect data generated by diff of SUM > --- > > Key: PIG-1525 > URL: https://issues.apache.org/jira/browse/PIG-1525 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1525.patch > > > Given data; > input1: > {code} > id9 0 > {code} > input2: > {code} > id8 1 > id9 1 > {code} > Pig script > {code} > A = LOAD 'input1' AS (id:chararray, val:long); > B = LOAD 'input2' AS (id:chararray, val:long); > C = COGROUP A BY id, B BY id; > D = FOREACH C GENERATE group, SUM(B.val), SUM(A.val), (SUM(A.val) - > SUM(B.val)); > dump D; > {code} > generates incorrect data: > {code} > (id8,1L,,) > (id9,1L,0L,-2L) > {code} > The workaround is to replace the FOREACH statement with > {code} > D = FOREACH C GENERATE group, SUM(B.val) as b, SUM(A.val) as a; > E = FOREACH D GENERATE $0, b, a, (a-b); > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1525) Incorrect data generated by diff of SUM
[ https://issues.apache.org/jira/browse/PIG-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1525: -- Attachment: PIG-1525_1.patch Thanks Thejas for suggesting a simple fix. The new patch passed core tests. > Incorrect data generated by diff of SUM > --- > > Key: PIG-1525 > URL: https://issues.apache.org/jira/browse/PIG-1525 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1525.patch, PIG-1525_1.patch > > > Given data; > input1: > {code} > id9 0 > {code} > input2: > {code} > id8 1 > id9 1 > {code} > Pig script > {code} > A = LOAD 'input1' AS (id:chararray, val:long); > B = LOAD 'input2' AS (id:chararray, val:long); > C = COGROUP A BY id, B BY id; > D = FOREACH C GENERATE group, SUM(B.val), SUM(A.val), (SUM(A.val) - > SUM(B.val)); > dump D; > {code} > generates incorrect data: > {code} > (id8,1L,,) > (id9,1L,0L,-2L) > {code} > The workaround is to replace the FOREACH statement with > {code} > D = FOREACH C GENERATE group, SUM(B.val) as b, SUM(A.val) as a; > E = FOREACH D GENERATE $0, b, a, (a-b); > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1525) Incorrect data generated by diff of SUM
[ https://issues.apache.org/jira/browse/PIG-1525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1525: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed > Incorrect data generated by diff of SUM > --- > > Key: PIG-1525 > URL: https://issues.apache.org/jira/browse/PIG-1525 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1525.patch, PIG-1525_1.patch > > > Given data; > input1: > {code} > id9 0 > {code} > input2: > {code} > id8 1 > id9 1 > {code} > Pig script > {code} > A = LOAD 'input1' AS (id:chararray, val:long); > B = LOAD 'input2' AS (id:chararray, val:long); > C = COGROUP A BY id, B BY id; > D = FOREACH C GENERATE group, SUM(B.val), SUM(A.val), (SUM(A.val) - > SUM(B.val)); > dump D; > {code} > generates incorrect data: > {code} > (id8,1L,,) > (id9,1L,0L,-2L) > {code} > The workaround is to replace the FOREACH statement with > {code} > D = FOREACH C GENERATE group, SUM(B.val) as b, SUM(A.val) as a; > E = FOREACH D GENERATE $0, b, a, (a-b); > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-103) Shared Job /tmp location should be configurable
[ https://issues.apache.org/jira/browse/PIG-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-103: - Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed The patch committed to the trunk. Thanks Niraj. > Shared Job /tmp location should be configurable > --- > > Key: PIG-103 > URL: https://issues.apache.org/jira/browse/PIG-103 > Project: Pig > Issue Type: Improvement > Components: impl > Environment: Partially shared file:// filesystem (eg NFS) >Reporter: Craig Macdonald >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: conf_tmp_dir.patch, conf_tmp_dir_2.patch > > > Hello, > I'm investigating running pig in an environment where various parts of the > file:// filesystem are available on all nodes. I can tell hadoop to use a > file:// file system location for it's default, by seting > fs.default.name=file://path/to/shared/folder > However, this creates issues for Pig, as Pig writes it's job information in a > folder that it assumes is a shared FS (eg DFS). However, in this scenario > /tmp is not shared on each machine. > So /tmp should either be configurable, or Hadoop should tell you the actual > full location set in fs.default.name? > Straightforward solution is to make "/tmp/" a property in > src/org/apache/pig/impl/io/FileLocalizer.java init(PigContext) > Any suggestions of property names? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1541) FR Join shouldn't match null values
FR Join shouldn't match null values --- Key: PIG-1541 URL: https://issues.apache.org/jira/browse/PIG-1541 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Here is an example: Data input: {code} 1 1 2 {code} the script {code} a = load 'input'; b = load 'input'; c = join a by $0, b by $0 using 'repl'; dump c; {code} generates results that matches null values: {code} (1,1,1,1) (,2,,2) {code} The regular join, on the other hand, gives the correct results. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897451#action_12897451 ] Richard Ding commented on PIG-1458: --- The proposal is to run another map-reduce job to merge the small files before the replicated join. This additional job will be added to the MR plan at the compile time. We consider three cases of a replicated join: # The right input is a map-only job and input files exist at the compile time. # The right input is a map-only job and input files do not exist at the compile time. # The right input is a map-reduce job. For 1., if the number of files exceeds the threshold specified in the property file (_pig.frjoin.merge.files.threshold_), a merge job is added between right input job and FR join job. For 3., if the number of reducers exceeds the threshold specified in the property file (_pig.frjoin.merge.files.threshold_), a merge job is added between right input job and FR join job. For 2., if the flag specified in the property file (_pig.frjoin.merge.files.optimistic_) is false, a merge job is added between right input job and FR join job. The default value of this flag is false. > aggregate files for replicated join > --- > > Key: PIG-1458 > URL: https://issues.apache.org/jira/browse/PIG-1458 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Richard Ding > Fix For: 0.8.0 > > > We have noticed that if the smaller data in replicated join has many files, > this puts unneeded burden on the name node. pre-aggregating the files can > improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-103) Shared Job /tmp location should be configurable
[ https://issues.apache.org/jira/browse/PIG-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-103: - Tags: documentation > Shared Job /tmp location should be configurable > --- > > Key: PIG-103 > URL: https://issues.apache.org/jira/browse/PIG-103 > Project: Pig > Issue Type: Improvement > Components: impl > Environment: Partially shared file:// filesystem (eg NFS) >Reporter: Craig Macdonald >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: conf_tmp_dir.patch, conf_tmp_dir_2.patch > > > Hello, > I'm investigating running pig in an environment where various parts of the > file:// filesystem are available on all nodes. I can tell hadoop to use a > file:// file system location for it's default, by seting > fs.default.name=file://path/to/shared/folder > However, this creates issues for Pig, as Pig writes it's job information in a > folder that it assumes is a shared FS (eg DFS). However, in this scenario > /tmp is not shared on each machine. > So /tmp should either be configurable, or Hadoop should tell you the actual > full location set in fs.default.name? > Straightforward solution is to make "/tmp/" a property in > src/org/apache/pig/impl/io/FileLocalizer.java init(PigContext) > Any suggestions of property names? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897484#action_12897484 ] Richard Ding commented on PIG-1458: --- For 1. and 2. above, another approach is to do nothing and rely on MultiFileInputFormat (PIG-1518) to merge small files. > aggregate files for replicated join > --- > > Key: PIG-1458 > URL: https://issues.apache.org/jira/browse/PIG-1458 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Richard Ding > Fix For: 0.8.0 > > > We have noticed that if the smaller data in replicated join has many files, > this puts unneeded burden on the name node. pre-aggregating the files can > improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1541) FR Join shouldn't match null values
[ https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1541: -- Status: Patch Available (was: Open) > FR Join shouldn't match null values > --- > > Key: PIG-1541 > URL: https://issues.apache.org/jira/browse/PIG-1541 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1541.patch > > > Here is an example: > Data input: > {code} > 1 1 > 2 > {code} > the script > {code} > a = load 'input'; > b = load 'input'; > c = join a by $0, b by $0 using 'repl'; > dump c; > {code} > generates results that matches null values: > {code} > (1,1,1,1) > (,2,,2) > {code} > The regular join, on the other hand, gives the correct results. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1541) FR Join shouldn't match null values
[ https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1541: -- Attachment: PIG-1541.patch > FR Join shouldn't match null values > --- > > Key: PIG-1541 > URL: https://issues.apache.org/jira/browse/PIG-1541 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1541.patch > > > Here is an example: > Data input: > {code} > 1 1 > 2 > {code} > the script > {code} > a = load 'input'; > b = load 'input'; > c = join a by $0, b by $0 using 'repl'; > dump c; > {code} > generates results that matches null values: > {code} > (1,1,1,1) > (,2,,2) > {code} > The regular join, on the other hand, gives the correct results. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1541) FR Join shouldn't match null values
[ https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897866#action_12897866 ] Richard Ding commented on PIG-1541: --- Results of test-patch: {code} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to i [exec] nclude 6 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. {code} > FR Join shouldn't match null values > --- > > Key: PIG-1541 > URL: https://issues.apache.org/jira/browse/PIG-1541 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1541.patch > > > Here is an example: > Data input: > {code} > 1 1 > 2 > {code} > the script > {code} > a = load 'input'; > b = load 'input'; > c = join a by $0, b by $0 using 'repl'; > dump c; > {code} > generates results that matches null values: > {code} > (1,1,1,1) > (,2,,2) > {code} > The regular join, on the other hand, gives the correct results. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1541) FR Join shouldn't match null values
[ https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1541: -- Attachment: PIG-1541_1.patch New patch to address the general case where the join key is tuple. > FR Join shouldn't match null values > --- > > Key: PIG-1541 > URL: https://issues.apache.org/jira/browse/PIG-1541 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1541.patch, PIG-1541_1.patch > > > Here is an example: > Data input: > {code} > 1 1 > 2 > {code} > the script > {code} > a = load 'input'; > b = load 'input'; > c = join a by $0, b by $0 using 'repl'; > dump c; > {code} > generates results that matches null values: > {code} > (1,1,1,1) > (,2,,2) > {code} > The regular join, on the other hand, gives the correct results. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1448) Detach tuple from inner plans of physical operator
[ https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898450#action_12898450 ] Richard Ding commented on PIG-1448: --- +1. Looks good. > Detach tuple from inner plans of physical operator > --- > > Key: PIG-1448 > URL: https://issues.apache.org/jira/browse/PIG-1448 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0 >Reporter: Ashutosh Chauhan >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: multi_oom_filt.pig, PIG-1448.1.patch > > > This is a follow-up on PIG-1446 which only addresses this general problem for > a specific instance of For Each. In general, all the physical operators which > can have inner plans are vulnerable to this. Few of them include > POLocalRearrange, POFilter, POCollectedGroup etc. Need to fix all of these. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1541) FR Join shouldn't match null values
[ https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1541: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Tests are successful. The patch is committed to the trunk. > FR Join shouldn't match null values > --- > > Key: PIG-1541 > URL: https://issues.apache.org/jira/browse/PIG-1541 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1541.patch, PIG-1541_1.patch > > > Here is an example: > Data input: > {code} > 1 1 > 2 > {code} > the script > {code} > a = load 'input'; > b = load 'input'; > c = join a by $0, b by $0 using 'repl'; > dump c; > {code} > generates results that matches null values: > {code} > (1,1,1,1) > (,2,,2) > {code} > The regular join, on the other hand, gives the correct results. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1392) Parser fails to recognize valid field
[ https://issues.apache.org/jira/browse/PIG-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1392: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed The parser bug is fixed, but encounters another problem which is tracked by PIG-1545. The work around is to disable the secondary key optimization. The patch is committed to the trunk. > Parser fails to recognize valid field > - > > Key: PIG-1392 > URL: https://issues.apache.org/jira/browse/PIG-1392 > Project: Pig > Issue Type: Bug >Reporter: Ankur >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: nested_parser.patch > > > Using this script below, parser fails to recognize a valid field in the > relation and throws error > A = LOAD '/tmp' as (a:int, b:chararray, c:int); > B = GROUP A BY (a, b); > C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ; > The error thrown is > 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: > chararray),A: {a: int,b: chararray,c: int}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1392) Parser fails to recognize valid field
[ https://issues.apache.org/jira/browse/PIG-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899003#action_12899003 ] Richard Ding commented on PIG-1392: --- Thanks Niraj for fixing this issue. > Parser fails to recognize valid field > - > > Key: PIG-1392 > URL: https://issues.apache.org/jira/browse/PIG-1392 > Project: Pig > Issue Type: Bug >Reporter: Ankur >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: nested_parser.patch > > > Using this script below, parser fails to recognize a valid field in the > relation and throws error > A = LOAD '/tmp' as (a:int, b:chararray, c:int); > B = GROUP A BY (a, b); > C = FOREACH B { bg = A.(b,c); GENERATE group, bg; } ; > The error thrown is > 2010-04-23 10:16:20,610 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1000: Error during parsing. Invalid alias: c in {group: (a: int,b: > chararray),A: {a: int,b: chararray,c: int}} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1334) Make pig artifacts available through maven
[ https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899053#action_12899053 ] Richard Ding commented on PIG-1334: --- bq. 2. This jar is 11MB and includes a bunch of dependencies, many of which are optional: We should deploy _pig-0.8.0-SNAPSHOT-core.jar (which contains only Pig classes) instead of _pig-0.8.0-SNAPSHOT.jar_ (which also contains dependent jars). > Make pig artifacts available through maven > -- > > Key: PIG-1334 > URL: https://issues.apache.org/jira/browse/PIG-1334 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, > mvn_pig_4.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
[ https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1452: -- Attachment: PIG-1452_3.patch I resynced the patch with the trunk and the size of pig.jar now is about 8M. > to remove hadoop20.jar from lib and use hadoop from the apache maven repo. > -- > > Key: PIG-1452 > URL: https://issues.apache.org/jira/browse/PIG-1452 > Project: Pig > Issue Type: Improvement > Components: build >Affects Versions: 0.8.0 >Reporter: Giridharan Kesavan >Assignee: Giridharan Kesavan > Fix For: 0.8.0 > > Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH > > > pig use ivy for dependency management. But still it uses hadoop20.jar from > the lib folder. > Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig > should leverage ivy for resolving/retrieving hadoop artifacts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
[ https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1452: -- Attachment: PIG-1452V4.PATCH New patch fixing the contrib projects. > to remove hadoop20.jar from lib and use hadoop from the apache maven repo. > -- > > Key: PIG-1452 > URL: https://issues.apache.org/jira/browse/PIG-1452 > Project: Pig > Issue Type: Improvement > Components: build >Affects Versions: 0.8.0 >Reporter: Giridharan Kesavan >Assignee: Giridharan Kesavan > Fix For: 0.8.0 > > Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, > PIG-1452V4.PATCH > > > pig use ivy for dependency management. But still it uses hadoop20.jar from > the lib folder. > Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig > should leverage ivy for resolving/retrieving hadoop artifacts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
[ https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1452: -- Status: Open (was: Patch Available) > to remove hadoop20.jar from lib and use hadoop from the apache maven repo. > -- > > Key: PIG-1452 > URL: https://issues.apache.org/jira/browse/PIG-1452 > Project: Pig > Issue Type: Improvement > Components: build >Affects Versions: 0.8.0 >Reporter: Giridharan Kesavan >Assignee: Giridharan Kesavan > Fix For: 0.8.0 > > Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, > PIG-1452V4.PATCH > > > pig use ivy for dependency management. But still it uses hadoop20.jar from > the lib folder. > Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig > should leverage ivy for resolving/retrieving hadoop artifacts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
[ https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1452: -- Status: Patch Available (was: Open) > to remove hadoop20.jar from lib and use hadoop from the apache maven repo. > -- > > Key: PIG-1452 > URL: https://issues.apache.org/jira/browse/PIG-1452 > Project: Pig > Issue Type: Improvement > Components: build >Affects Versions: 0.8.0 >Reporter: Giridharan Kesavan >Assignee: Giridharan Kesavan > Fix For: 0.8.0 > > Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, > PIG-1452V4.PATCH > > > pig use ivy for dependency management. But still it uses hadoop20.jar from > the lib folder. > Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig > should leverage ivy for resolving/retrieving hadoop artifacts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
[ https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899631#action_12899631 ] Richard Ding commented on PIG-1452: --- The target "buildJar-withouthadoop" doesn't depend on hadoop20.jar so this change doesn't affect this target. > to remove hadoop20.jar from lib and use hadoop from the apache maven repo. > -- > > Key: PIG-1452 > URL: https://issues.apache.org/jira/browse/PIG-1452 > Project: Pig > Issue Type: Improvement > Components: build >Affects Versions: 0.8.0 >Reporter: Giridharan Kesavan >Assignee: Giridharan Kesavan > Fix For: 0.8.0 > > Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, > PIG-1452V4.PATCH > > > pig use ivy for dependency management. But still it uses hadoop20.jar from > the lib folder. > Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig > should leverage ivy for resolving/retrieving hadoop artifacts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
[ https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1452: -- Status: Resolved (was: Patch Available) Resolution: Fixed > to remove hadoop20.jar from lib and use hadoop from the apache maven repo. > -- > > Key: PIG-1452 > URL: https://issues.apache.org/jira/browse/PIG-1452 > Project: Pig > Issue Type: Improvement > Components: build >Affects Versions: 0.8.0 >Reporter: Giridharan Kesavan >Assignee: Giridharan Kesavan > Fix For: 0.8.0 > > Attachments: PIG-1452.PATCH, PIG-1452_3.patch, PIG-1452V2.PATCH, > PIG-1452V4.PATCH > > > pig use ivy for dependency management. But still it uses hadoop20.jar from > the lib folder. > Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig > should leverage ivy for resolving/retrieving hadoop artifacts. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1497) Mandatory rule PartitionFilterOptimizer
[ https://issues.apache.org/jira/browse/PIG-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900100#action_12900100 ] Richard Ding commented on PIG-1497: --- Looks good. A few comments: In _PartitionFilterPushDown_: * In _check_ method, why changes the condition from {code} if(... || sucs.size() != 1 || ...) { {code} to {code} if(... || succeds.size() == 0 || ...) {code} * In _transform_ method, the original code {code} // remove this filter from the plan mPlan.removeAndReconnect(loFilter); {code} is replaced by its own implementation. It seems better to also migrate the _removeAndReconnect_ to the new _OperatorPlan_ since the logic there is more complicated (keeping the order of connections). * The javadoc for the class isn't migrated. * Several variables (e.g. loadFunc, loLoad, loFilter, ...) now have scope within the _PartitionFilterPushDownTransformer_ class, so it would be better to put them inside the transformer class. In addition, * Need to remove all the tabs from the files and replace them with 4 spaces. * Several unit tests now fail due to the dependency on other jiras. > Mandatory rule PartitionFilterOptimizer > --- > > Key: PIG-1497 > URL: https://issues.apache.org/jira/browse/PIG-1497 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.8.0 >Reporter: Daniel Dai >Assignee: Xuefu Zhang > Fix For: 0.8.0 > > Attachments: jira-1497-0.patch > > > Need to migrate PartitionFilterOptimizer to new logical optimizer. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1514) Migrate logical optimization rule: OpLimitOptimizer
[ https://issues.apache.org/jira/browse/PIG-1514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900376#action_12900376 ] Richard Ding commented on PIG-1514: --- Patch looks good. A couple of comments: * It would be better to refactor the graph manipulation code into a helper class so that the graph transformation routines (such as swap, insert, remove, replace, ...) can be shared by all rules. * Please remove tabs from the file. > Migrate logical optimization rule: OpLimitOptimizer > --- > > Key: PIG-1514 > URL: https://issues.apache.org/jira/browse/PIG-1514 > Project: Pig > Issue Type: Sub-task > Components: impl >Affects Versions: 0.7.0 >Reporter: Daniel Dai >Assignee: Xuefu Zhang > Fix For: 0.8.0 > > Attachments: jira-1514-0.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1334) Make pig artifacts available through maven
[ https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900518#action_12900518 ] Richard Ding commented on PIG-1334: --- The new output is at https://repository.apache.org/content/repositories/snapshots/org/apache/hadoop/pig/0.8.0-SNAPSHOT/ > Make pig artifacts available through maven > -- > > Key: PIG-1334 > URL: https://issues.apache.org/jira/browse/PIG-1334 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, > mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1505) support jars and scripts in dfs
[ https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900811#action_12900811 ] Richard Ding commented on PIG-1505: --- The results of test-patch: {code} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. {code} I'll commit the patch after running unit tests. > support jars and scripts in dfs > --- > > Key: PIG-1505 > URL: https://issues.apache.org/jira/browse/PIG-1505 > Project: Pig > Issue Type: Improvement >Reporter: Andrew Hitchcock >Assignee: Andrew Hitchcock > Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, > pig-jars-and-scripts-from-dfs-trunk-1.patch, > pig-jars-and-scripts-from-dfs-trunk-2.patch, > pig-jars-and-scripts-from-dfs-trunk.patch > > > Pig can't operate on files stored in Amazon S3. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1505) support jars and scripts in dfs
[ https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1505: -- Fix Version/s: 0.8.0 Affects Version/s: 0.7.0 > support jars and scripts in dfs > --- > > Key: PIG-1505 > URL: https://issues.apache.org/jira/browse/PIG-1505 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.7.0 >Reporter: Andrew Hitchcock >Assignee: Andrew Hitchcock > Fix For: 0.8.0 > > Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, > pig-jars-and-scripts-from-dfs-trunk-1.patch, > pig-jars-and-scripts-from-dfs-trunk-2.patch, > pig-jars-and-scripts-from-dfs-trunk.patch > > > Pig can't operate on files stored in Amazon S3. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1334) Make pig artifacts available through maven
[ https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1334: -- Hadoop Flags: [Reviewed] Release Note: ant mvn-install :To install artifact to the local filesystem ant mvn-deploy : To deploy snapshots to the apache nexus repo (looks for authentication in the ~/.m2/settings.xml) ant mvn-deploy -Drepo=staging :To deploy artifacts for voting before release , this also requires authentication configured in ~/.m2/settings.xml Deploying artifacts to the staging repository requires signing the artifacts with gpg keys, mvn-deploy target takes care of signing the artifacts. While executing mvn-deploy target with -Drepo=staging it would ask for gpg passphrase which need to be keyed in. Once the deployment is successful, to make the artifact available in the staging repository , login into the staging repository and close the staging by right clicking on the staged artifact at http:/repository.apache.org was: ant mvn-install :To install artifact to the local filesystem ant mvn-deploy : To deploy snapshots to the apache nexus repo (looks for authentication in the ~/.m2/settings.xml) ant mvn-deploy -Drepo=staging :To deploy artifacts for voting before release , this also requires authentication configured in ~/.m2/settings.xml Deploying artifacts to the staging repository requires signing the artifacts with gpg keys, mvn-deploy target takes care of signing the artifacts. While executing mvn-deploy target with -Drepo=staging it would ask for gpg passphrase which need to be keyed in. Once the deployment is successful, to make the artifact available in the staging repository , login into the staging repository and close the staging by right clicking on the staged artifact at http:/repository.apache.org With this patch I have already uploaded artifacts to the stating repository; (only ppl with committer access would be able to view this, as the repository is not closed yet) > Make pig artifacts available through maven > -- > > Key: PIG-1334 > URL: https://issues.apache.org/jira/browse/PIG-1334 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, > mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1334) Make pig artifacts available through maven
[ https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1334: -- Status: Resolved (was: Patch Available) Resolution: Fixed The patch is committed to the trunk. Thanks Niraj for making this feature available. > Make pig artifacts available through maven > -- > > Key: PIG-1334 > URL: https://issues.apache.org/jira/browse/PIG-1334 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, > mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1505) support jars and scripts in dfs
[ https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1505: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed All core tests passed. The patch is committed to the trunk. Thanks Andrew for contributing this feature! > support jars and scripts in dfs > --- > > Key: PIG-1505 > URL: https://issues.apache.org/jira/browse/PIG-1505 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.7.0 >Reporter: Andrew Hitchcock >Assignee: Andrew Hitchcock > Fix For: 0.8.0 > > Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, > pig-jars-and-scripts-from-dfs-trunk-1.patch, > pig-jars-and-scripts-from-dfs-trunk-2.patch, > pig-jars-and-scripts-from-dfs-trunk.patch > > > Pig can't operate on files stored in Amazon S3. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1505) support jars and scripts in dfs
[ https://issues.apache.org/jira/browse/PIG-1505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1505: -- Release Note: Pig now supports running scripts and registering jars that are stored in HDFS, Amazon S3, or other distributed file systems. (was: Pig now supports running scripts and registering jars that are stored in HDFS, Amazon S3, or other distributed file systems. Also added a -R parameter which allows users to specify properties in key=value form on the command line.) Remove -R option. In 0.8 Pig supports generic parameters such as -Dkey=value. > support jars and scripts in dfs > --- > > Key: PIG-1505 > URL: https://issues.apache.org/jira/browse/PIG-1505 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.7.0 >Reporter: Andrew Hitchcock >Assignee: Andrew Hitchcock > Fix For: 0.8.0 > > Attachments: PIG-1505-4.patch, pig-jars-and-scripts-from-dfs-3.patch, > pig-jars-and-scripts-from-dfs-trunk-1.patch, > pig-jars-and-scripts-from-dfs-trunk-2.patch, > pig-jars-and-scripts-from-dfs-trunk.patch > > > Pig can't operate on files stored in Amazon S3. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901600#action_12901600 ] Richard Ding commented on PIG-1518: --- +1. The patch looks good. A few of minor points: * In PigSplit, the method add(InputSplit split) is not used and can be removed * In MapRedUtil, it would be better to not leave the debug verification code in the source code * In PigRecordReader, the code can be simplified if the initNextRecordReader() from constructor to initialize() method > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901656#action_12901656 ] Richard Ding commented on PIG-1551: --- In Invoker.java, there is a typo: {code} private static final Class LONG_ARRAY_CLASS = new String[0].getClass(); {code} also in unPrimitivize method, this code seems unnecessary: {code} } else if (klass.equals(DOUBLE_ARRAY_CLASS)) { return DOUBLE_ARRAY_CLASS; {code} Otherwise the patch looks good. > Improve dynamic invokers to deal with no-arg methods and array parameters > - > > Key: PIG-1551 > URL: https://issues.apache.org/jira/browse/PIG-1551 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.8.0 >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Fix For: 0.8.0 > > Attachments: PIG-1551.patch > > > PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple > Java methods in a UDF, so that users don't need to create trivial wrappers if > they are ok sacrificing some speed. > This issue is to extend the set of methods that can be wrapped this way to > include methods that do not take any arguments, and methods that take arrays > of {int,long,float,double,string} as arguments. > Arrays are expected to be represented by bags in Pig. Notably, this allows > users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1560) Build target 'checkstyle' fails
Build target 'checkstyle' fails --- Key: PIG-1560 URL: https://issues.apache.org/jira/browse/PIG-1560 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Richard Ding Assignee: Giridharan Kesavan Fix For: 0.8.0 Stack trace: {code} /homes/rding/apache-pig/trunk/build.xml:894: java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.commons.beanutils.ConvertUtilsBean.(ConvertUtilsBean.java:130) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222) at com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372) at com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304) at com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360) at org.apache.tools.ant.Project.executeTarget(Project.java:1329) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1212) at org.apache.tools.ant.Main.runBuild(Main.java:801) at org.apache.tools.ant.Main.startAnt(Main.java:218) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386) at org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336) at org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 22 more {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1560) Build target 'checkstyle' fails
[ https://issues.apache.org/jira/browse/PIG-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1560: -- Description: Stack trace: {code} /trunk/build.xml:894: java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.commons.beanutils.ConvertUtilsBean.(ConvertUtilsBean.java:130) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222) at com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372) at com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304) at com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360) at org.apache.tools.ant.Project.executeTarget(Project.java:1329) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1212) at org.apache.tools.ant.Main.runBuild(Main.java:801) at org.apache.tools.ant.Main.startAnt(Main.java:218) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386) at org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336) at org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 22 more {code} was: Stack trace: {code} /homes/rding/apache-pig/trunk/build.xml:894: java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.commons.beanutils.ConvertUtilsBean.(ConvertUtilsBean.java:130) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.createBeanUtilsBean(AutomaticBean.java:73) at com.puppycrawl.tools.checkstyle.api.AutomaticBean.contextualize(AutomaticBean.java:222) at com.puppycrawl.tools.checkstyle.CheckStyleTask.createChecker(CheckStyleTask.java:372) at com.puppycrawl.tools.checkstyle.CheckStyleTask.realExecute(CheckStyleTask.java:304) at com.puppycrawl.tools.checkstyle.CheckStyleTask.execute(CheckStyleTask.java:265) at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:291) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106) at org.apache.tools.ant.Task.perform(Task.java:348) at org.apache.tools.ant.Target.execute(Target.java:390) at org.apache.tools.ant.Target.performTasks(Target.java:411) at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1360) at org.apache.tools.ant.Project.executeTarget(Project.java:1329) at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) at org.apache.tools.ant.Project.executeTargets(Project.java:1212) at org.apache.tools.ant.Main.runBuild(Main.java:801) at org.apache.tools.ant.Main.startAnt(Main.java:218) at org.apache.tools.ant.launch.Launcher.run(Launcher.java:280) at org.apache.tools.ant.launch.Launcher.main(Launcher.java:109) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at org.apache.tools.ant.AntClassLoader.findClassInComponents(AntClassLoader.java:1386) at org.apache.tools.ant.AntClassLoader.findClass(AntClassLoader.java:1336) at org.apache.tools.ant.AntClassLoader.loadClass(AntClassLoader.java:1074) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) ... 22 more {code} > Build target 'checkstyle' fails > --- > > Key: PIG-1560 > URL: https://issues.
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Attachment: PIG-1557.patch The alias for load statement is missing. Add load alias to the alias list. > couple of issue mapping aliases to jobs > --- > > Key: PIG-1557 > URL: https://issues.apache.org/jira/browse/PIG-1557 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Olga Natkovich >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1557.patch > > > I have a simple script: > A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); > B = group A by name; > C = foreach B generate group, COUNT(A); > D = order C by $1; > E = limit D 10; > dump E; > I noticed a couple of issues with alias to job mapping: neither load(A) nor > limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Fix Version/s: 0.8.0 > couple of issue mapping aliases to jobs > --- > > Key: PIG-1557 > URL: https://issues.apache.org/jira/browse/PIG-1557 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Olga Natkovich >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1557.patch > > > I have a simple script: > A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); > B = group A by name; > C = foreach B generate group, COUNT(A); > D = order C by $1; > E = limit D 10; > dump E; > I noticed a couple of issues with alias to job mapping: neither load(A) nor > limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901992#action_12901992 ] Richard Ding commented on PIG-1551: --- The typo is still there: {code} private static final Class LONG_ARRAY_CLASS = new Long[0].getClass(); {code} It seems what you want is {code} private static final Class LONG_ARRAY_CLASS = new long[0].getClass(); {code} so it's consistent with other array classes. This does raise a question about array parameters: the first form applies to methods like _amethod(Long[] nums)_, while the second supports methods like _amethod(long[] nums)_. And they are not exchangeable. > Improve dynamic invokers to deal with no-arg methods and array parameters > - > > Key: PIG-1551 > URL: https://issues.apache.org/jira/browse/PIG-1551 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.8.0 >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Fix For: 0.8.0 > > Attachments: PIG-1551.patch, PIG_1551.2.patch > > > PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple > Java methods in a UDF, so that users don't need to create trivial wrappers if > they are ok sacrificing some speed. > This issue is to extend the set of methods that can be wrapped this way to > include methods that do not take any arguments, and methods that take arrays > of {int,long,float,double,string} as arguments. > Arrays are expected to be represented by bags in Pig. Notably, this allows > users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902030#action_12902030 ] Richard Ding commented on PIG-1343: --- The log file is created when running in batch mode, but not in interactive mode. > pig_log file missing even though Main tells it is creating one and an M/R job > fails > > > Key: PIG-1343 > URL: https://issues.apache.org/jira/browse/PIG-1343 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: 1343.patch, PIG-1343-1.patch > > > There is a particular case where I was running with the latest trunk of Pig. > {code} > $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig > [main] INFO org.apache.pig.Main - Logging error messages to: > /homes/viraj/pig_1263420012601.log > $ls -l pig_1263420012601.log > ls: pig_1263420012601.log: No such file or directory > {code} > The job failed and the log file did not contain anything, the only way to > debug was to look into the Jobtracker logs. > Here are some reasons which would have caused this behavior: > 1) The underlying filer/NFS had some issues. In that case do we not error on > stdout? > 2) There are some errors from the backend which are not being captured > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902042#action_12902042 ] Richard Ding commented on PIG-1551: --- +1. I'm fine with arrays of primitive types. I can't think of a Java method that uses an array of object Long as a parameter. > Improve dynamic invokers to deal with no-arg methods and array parameters > - > > Key: PIG-1551 > URL: https://issues.apache.org/jira/browse/PIG-1551 > Project: Pig > Issue Type: Improvement >Affects Versions: 0.8.0 >Reporter: Dmitriy V. Ryaboy >Assignee: Dmitriy V. Ryaboy > Fix For: 0.8.0 > > Attachments: PIG-1551.patch, PIG_1551.2.patch, PIG_1551.3.patch > > > PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple > Java methods in a UDF, so that users don't need to create trivial wrappers if > they are ok sacrificing some speed. > This issue is to extend the set of methods that can be wrapped this way to > include methods that do not take any arguments, and methods that take arrays > of {int,long,float,double,string} as arguments. > Arrays are expected to be represented by bags in Pig. Notably, this allows > users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1483: -- Attachment: PIG-1483_1.patch New patch adding unit test. > [piggybank] Add HadoopJobHistoryLoader to the piggybank > --- > > Key: PIG-1483 > URL: https://issues.apache.org/jira/browse/PIG-1483 > Project: Pig > Issue Type: New Feature >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1483.patch, PIG-1483_1.patch > > > PIG-1333 added many script-related entries to the MR job xml file and thus > it's now possible to use Pig for querying Hadoop job history/xml files to get > script-level usage statistics. What we need is a Pig loader that can parse > these files and generate corresponding data objects. > The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. > Here is an example that shows the intended usage: > *Find all the jobs grouped by script and user:* > {code} > a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as > (j:map[], m:map[], r:map[]); > b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) > j#'USER' as user, (Chararray) j#'JOBID' as job; > c = filter b by not (id is null); > d = group c by (id, user); > e = foreach d generate flatten(group), c.job; > dump e; > {code} > A couple more examples: > *Find scripts that use only the default parallelism:* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) r#'NUMBER_REDUCES' as reduces; > c = group b by (id, user, script_name) parallel 10; > d = foreach c generate group.user, group.script_name, MAX(b.reduces) as > max_reduces; > e = filter d by max_reduces == 1; > dump e; > {code} > *Find the running time of each script (in seconds):* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as > end; > c = group b by (id, user, script_name) > d = foreach c generate group.user, group.script_name, (MAX(b.end) - > MIN(b.start)/1000; > dump d; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1483: -- Status: Patch Available (was: Open) > [piggybank] Add HadoopJobHistoryLoader to the piggybank > --- > > Key: PIG-1483 > URL: https://issues.apache.org/jira/browse/PIG-1483 > Project: Pig > Issue Type: New Feature >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1483.patch, PIG-1483_1.patch > > > PIG-1333 added many script-related entries to the MR job xml file and thus > it's now possible to use Pig for querying Hadoop job history/xml files to get > script-level usage statistics. What we need is a Pig loader that can parse > these files and generate corresponding data objects. > The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. > Here is an example that shows the intended usage: > *Find all the jobs grouped by script and user:* > {code} > a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as > (j:map[], m:map[], r:map[]); > b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) > j#'USER' as user, (Chararray) j#'JOBID' as job; > c = filter b by not (id is null); > d = group c by (id, user); > e = foreach d generate flatten(group), c.job; > dump e; > {code} > A couple more examples: > *Find scripts that use only the default parallelism:* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) r#'NUMBER_REDUCES' as reduces; > c = group b by (id, user, script_name) parallel 10; > d = foreach c generate group.user, group.script_name, MAX(b.reduces) as > max_reduces; > e = filter d by max_reduces == 1; > dump e; > {code} > *Find the running time of each script (in seconds):* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as > end; > c = group b by (id, user, script_name) > d = foreach c generate group.user, group.script_name, (MAX(b.end) - > MIN(b.start)/1000; > dump d; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Attachment: PIG-1557_1.patch New patch adds a unit test. > couple of issue mapping aliases to jobs > --- > > Key: PIG-1557 > URL: https://issues.apache.org/jira/browse/PIG-1557 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Olga Natkovich >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1557.patch, PIG-1557_1.patch > > > I have a simple script: > A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); > B = group A by name; > C = foreach B generate group, COUNT(A); > D = order C by $1; > E = limit D 10; > dump E; > I noticed a couple of issues with alias to job mapping: neither load(A) nor > limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Status: Patch Available (was: Open) Hadoop Flags: [Reviewed] > couple of issue mapping aliases to jobs > --- > > Key: PIG-1557 > URL: https://issues.apache.org/jira/browse/PIG-1557 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Olga Natkovich >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1557.patch, PIG-1557_1.patch > > > I have a simple script: > A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); > B = group A by name; > C = foreach B generate group, COUNT(A); > D = order C by $1; > E = limit D 10; > dump E; > I noticed a couple of issues with alias to job mapping: neither load(A) nor > limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1557: -- Status: Resolved (was: Patch Available) Resolution: Fixed > couple of issue mapping aliases to jobs > --- > > Key: PIG-1557 > URL: https://issues.apache.org/jira/browse/PIG-1557 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Olga Natkovich >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1557.patch, PIG-1557_1.patch > > > I have a simple script: > A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); > B = group A by name; > C = foreach B generate group, COUNT(A); > D = order C by $1; > E = limit D 10; > dump E; > I noticed a couple of issues with alias to job mapping: neither load(A) nor > limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1564) add support for multiple filesystems
[ https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902952#action_12902952 ] Richard Ding commented on PIG-1564: --- Hi Andrew, HDataStorage is a thin layer on top of Hadoop FileSystem. Since moving its local mode to Hadoop local mode, Pig no longer needs this layer. We intends to remove it in the feature. On Pig reading data from one file system and writing it to another, this feature is supported since Pig 0.7. -Richard > add support for multiple filesystems > > > Key: PIG-1564 > URL: https://issues.apache.org/jira/browse/PIG-1564 > Project: Pig > Issue Type: Improvement >Reporter: Andrew Hitchcock > Attachments: PIG-1564-1.patch > > > Currently you can't run Pig scripts that read data from one file system and > write it to another. Also, Grunt doesn't support CDing from one directory to > another on different file systems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding resolved PIG-1518. --- Hadoop Flags: [Reviewed] Resolution: Fixed Patch is committed to trunk. Thanks Yan. > multi file input format for loaders > --- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1569) java properties not honored in case of properties such as stop.on.failure
[ https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding reassigned PIG-1569: - Assignee: Richard Ding > java properties not honored in case of properties such as stop.on.failure > - > > Key: PIG-1569 > URL: https://issues.apache.org/jira/browse/PIG-1569 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: Richard Ding > Fix For: 0.8.0 > > > In org.apache.pig.Main , properties are being set to default value without > checking if the java system properties have been set to something else. > stop.on.failure, opt.multiquery, aggregate.warning are some properties that > have this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903072#action_12903072 ] Richard Ding commented on PIG-1343: --- The new patch logs NPE instead of the intended message: {code} [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. null {code} > pig_log file missing even though Main tells it is creating one and an M/R job > fails > > > Key: PIG-1343 > URL: https://issues.apache.org/jira/browse/PIG-1343 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch > > > There is a particular case where I was running with the latest trunk of Pig. > {code} > $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig > [main] INFO org.apache.pig.Main - Logging error messages to: > /homes/viraj/pig_1263420012601.log > $ls -l pig_1263420012601.log > ls: pig_1263420012601.log: No such file or directory > {code} > The job failed and the log file did not contain anything, the only way to > debug was to look into the Jobtracker logs. > Here are some reasons which would have caused this behavior: > 1) The underlying filer/NFS had some issues. In that case do we not error on > stdout? > 2) There are some errors from the backend which are not being captured > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1458: -- Attachment: PIG-1458.patch This patch uses the new multi-file-combiner (PIG-1518) to concatenate many small files for replicated join. This is based on the assumption that the total size of the replicated files should be small enough to fit into main memory. > aggregate files for replicated join > --- > > Key: PIG-1458 > URL: https://issues.apache.org/jira/browse/PIG-1458 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1458.patch > > > We have noticed that if the smaller data in replicated join has many files, > this puts unneeded burden on the name node. pre-aggregating the files can > improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1483: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed > [piggybank] Add HadoopJobHistoryLoader to the piggybank > --- > > Key: PIG-1483 > URL: https://issues.apache.org/jira/browse/PIG-1483 > Project: Pig > Issue Type: New Feature >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1483.patch, PIG-1483_1.patch > > > PIG-1333 added many script-related entries to the MR job xml file and thus > it's now possible to use Pig for querying Hadoop job history/xml files to get > script-level usage statistics. What we need is a Pig loader that can parse > these files and generate corresponding data objects. > The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. > Here is an example that shows the intended usage: > *Find all the jobs grouped by script and user:* > {code} > a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as > (j:map[], m:map[], r:map[]); > b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) > j#'USER' as user, (Chararray) j#'JOBID' as job; > c = filter b by not (id is null); > d = group c by (id, user); > e = foreach d generate flatten(group), c.job; > dump e; > {code} > A couple more examples: > *Find scripts that use only the default parallelism:* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) r#'NUMBER_REDUCES' as reduces; > c = group b by (id, user, script_name) parallel 10; > d = foreach c generate group.user, group.script_name, MAX(b.reduces) as > max_reduces; > e = filter d by max_reduces == 1; > dump e; > {code} > *Find the running time of each script (in seconds):* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as > end; > c = group b by (id, user, script_name) > d = foreach c generate group.user, group.script_name, (MAX(b.end) - > MIN(b.start)/1000; > dump d; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903523#action_12903523 ] Richard Ding commented on PIG-1343: --- I run above script in local mode, both batch mode and interactive mode now generate the expected result: {code} ERROR 2244: Job failed, hadoop does not return any error message {code} > pig_log file missing even though Main tells it is creating one and an M/R job > fails > > > Key: PIG-1343 > URL: https://issues.apache.org/jira/browse/PIG-1343 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: 1343.patch, PIG-1343-1.patch, pig_1343_2.patch, > pig_1343_4.patch, PIG_1343_5.patch > > > There is a particular case where I was running with the latest trunk of Pig. > {code} > $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig > [main] INFO org.apache.pig.Main - Logging error messages to: > /homes/viraj/pig_1263420012601.log > $ls -l pig_1263420012601.log > ls: pig_1263420012601.log: No such file or directory > {code} > The job failed and the log file did not contain anything, the only way to > debug was to look into the Jobtracker logs. > Here are some reasons which would have caused this behavior: > 1) The underlying filer/NFS had some issues. In that case do we not error on > stdout? > 2) There are some errors from the backend which are not being captured > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904267#action_12904267 ] Richard Ding commented on PIG-1343: --- Patch is committed to the trunk. Thanks Niraj. > pig_log file missing even though Main tells it is creating one and an M/R job > fails > > > Key: PIG-1343 > URL: https://issues.apache.org/jira/browse/PIG-1343 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: 1343.patch, PIG-1343-1.patch, PIG-1343_6.patch, > pig_1343_2.patch, pig_1343_4.patch, PIG_1343_5.patch > > > There is a particular case where I was running with the latest trunk of Pig. > {code} > $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig > [main] INFO org.apache.pig.Main - Logging error messages to: > /homes/viraj/pig_1263420012601.log > $ls -l pig_1263420012601.log > ls: pig_1263420012601.log: No such file or directory > {code} > The job failed and the log file did not contain anything, the only way to > debug was to look into the Jobtracker logs. > Here are some reasons which would have caused this behavior: > 1) The underlying filer/NFS had some issues. In that case do we not error on > stdout? > 2) There are some errors from the backend which are not being captured > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1343: -- Attachment: PIG-1343_6.patch > pig_log file missing even though Main tells it is creating one and an M/R job > fails > > > Key: PIG-1343 > URL: https://issues.apache.org/jira/browse/PIG-1343 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: 1343.patch, PIG-1343-1.patch, PIG-1343_6.patch, > pig_1343_2.patch, pig_1343_4.patch, PIG_1343_5.patch > > > There is a particular case where I was running with the latest trunk of Pig. > {code} > $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig > [main] INFO org.apache.pig.Main - Logging error messages to: > /homes/viraj/pig_1263420012601.log > $ls -l pig_1263420012601.log > ls: pig_1263420012601.log: No such file or directory > {code} > The job failed and the log file did not contain anything, the only way to > debug was to look into the Jobtracker logs. > Here are some reasons which would have caused this behavior: > 1) The underlying filer/NFS had some issues. In that case do we not error on > stdout? > 2) There are some errors from the backend which are not being captured > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails
[ https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1343: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed > pig_log file missing even though Main tells it is creating one and an M/R job > fails > > > Key: PIG-1343 > URL: https://issues.apache.org/jira/browse/PIG-1343 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.6.0 >Reporter: Viraj Bhat >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: 1343.patch, PIG-1343-1.patch, PIG-1343_6.patch, > pig_1343_2.patch, pig_1343_4.patch, PIG_1343_5.patch > > > There is a particular case where I was running with the latest trunk of Pig. > {code} > $java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig > [main] INFO org.apache.pig.Main - Logging error messages to: > /homes/viraj/pig_1263420012601.log > $ls -l pig_1263420012601.log > ls: pig_1263420012601.log: No such file or directory > {code} > The job failed and the log file did not contain anything, the only way to > debug was to look into the Jobtracker logs. > Here are some reasons which would have caused this behavior: > 1) The underlying filer/NFS had some issues. In that case do we not error on > stdout? > 2) There are some errors from the backend which are not being captured > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1578) PigServer.executeBatch does not return status of failed job
[ https://issues.apache.org/jira/browse/PIG-1578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding reassigned PIG-1578: - Assignee: Richard Ding > PigServer.executeBatch does not return status of failed job > --- > > Key: PIG-1578 > URL: https://issues.apache.org/jira/browse/PIG-1578 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: Richard Ding > Fix For: 0.8.0 > > > For failed job PigServer.executeBatch does not return ExecJob . > ExecJobs are created using output statistics, and the output statistics for > jobs that failed does not seem to exist. > The query i tried was a native mapreduce job, where the output file of the > native mr job already exists causing that job to fail. > {code} > A = load '" + INPUT_FILE + "'; > B = mapreduce '" + jarFileName + "' " + > "Store A into 'table_testNativeMRJobSimple_input' "+ > "Load 'table_testNativeMRJobSimple_output' "+ > "`WordCount table_testNativeMRJobSimple_input " + INPUT_FILE + > "`;"); > Store B into 'table_testNativeMRJobSimpleDir';); > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1570) native mapreduce operator MR job does not follow same failure handling logic as other pig MR jobs
[ https://issues.apache.org/jira/browse/PIG-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904321#action_12904321 ] Richard Ding commented on PIG-1570: --- +1. > native mapreduce operator MR job does not follow same failure handling logic > as other pig MR jobs > - > > Key: PIG-1570 > URL: https://issues.apache.org/jira/browse/PIG-1570 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1570.1.patch > > > The code path for handling failure in MR job corresponding to native MR is > different and does not have the same behavior. > For example, even if the MR job for mapreduce operator fails, the number of > jobs that failed is being reported as 0 in PigStats log. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1458: -- Attachment: PIG-1458_1.patch New patch addressing review comments. > aggregate files for replicated join > --- > > Key: PIG-1458 > URL: https://issues.apache.org/jira/browse/PIG-1458 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1458.patch, PIG-1458_1.patch > > > We have noticed that if the smaller data in replicated join has many files, > this puts unneeded burden on the name node. pre-aggregating the files can > improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1569) java properties not honored in case of properties such as stop.on.failure
[ https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1569: -- Status: Patch Available (was: Open) > java properties not honored in case of properties such as stop.on.failure > - > > Key: PIG-1569 > URL: https://issues.apache.org/jira/browse/PIG-1569 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1569.patch > > > In org.apache.pig.Main , properties are being set to default value without > checking if the java system properties have been set to something else. > stop.on.failure, opt.multiquery, aggregate.warning are some properties that > have this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1569) java properties not honored in case of properties such as stop.on.failure
[ https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1569: -- Attachment: PIG-1569.patch > java properties not honored in case of properties such as stop.on.failure > - > > Key: PIG-1569 > URL: https://issues.apache.org/jira/browse/PIG-1569 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1569.patch > > > In org.apache.pig.Main , properties are being set to default value without > checking if the java system properties have been set to something else. > stop.on.failure, opt.multiquery, aggregate.warning are some properties that > have this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904385#action_12904385 ] Richard Ding commented on PIG-1458: --- Koji, Please open a jira on increasing the replication factor of the replicated files. Now it uses the default replication factor. Thanks, -Richard > aggregate files for replicated join > --- > > Key: PIG-1458 > URL: https://issues.apache.org/jira/browse/PIG-1458 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1458.patch, PIG-1458_1.patch > > > We have noticed that if the smaller data in replicated join has many files, > this puts unneeded burden on the name node. pre-aggregating the files can > improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1569) java properties not honored in case of properties such as stop.on.failure
[ https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1569: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed > java properties not honored in case of properties such as stop.on.failure > - > > Key: PIG-1569 > URL: https://issues.apache.org/jira/browse/PIG-1569 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1569.patch > > > In org.apache.pig.Main , properties are being set to default value without > checking if the java system properties have been set to something else. > stop.on.failure, opt.multiquery, aggregate.warning are some properties that > have this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding resolved PIG-1458. --- Hadoop Flags: [Reviewed] Resolution: Fixed > aggregate files for replicated join > --- > > Key: PIG-1458 > URL: https://issues.apache.org/jira/browse/PIG-1458 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1458.patch, PIG-1458_1.patch > > > We have noticed that if the smaller data in replicated join has many files, > this puts unneeded burden on the name node. pre-aggregating the files can > improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1458) aggregate files for replicated join
[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904451#action_12904451 ] Richard Ding commented on PIG-1458: --- Patch committed to trunk. > aggregate files for replicated join > --- > > Key: PIG-1458 > URL: https://issues.apache.org/jira/browse/PIG-1458 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1458.patch, PIG-1458_1.patch > > > We have noticed that if the smaller data in replicated join has many files, > this puts unneeded burden on the name node. pre-aggregating the files can > improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1569) java properties not honored in case of properties such as stop.on.failure
[ https://issues.apache.org/jira/browse/PIG-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904452#action_12904452 ] Richard Ding commented on PIG-1569: --- Patch committed to trunk. > java properties not honored in case of properties such as stop.on.failure > - > > Key: PIG-1569 > URL: https://issues.apache.org/jira/browse/PIG-1569 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1569.patch > > > In org.apache.pig.Main , properties are being set to default value without > checking if the java system properties have been set to something else. > stop.on.failure, opt.multiquery, aggregate.warning are some properties that > have this problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1483) [piggybank] Add HadoopJobHistoryLoader to the piggybank
[ https://issues.apache.org/jira/browse/PIG-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904453#action_12904453 ] Richard Ding commented on PIG-1483: --- Patch committed to trunk. > [piggybank] Add HadoopJobHistoryLoader to the piggybank > --- > > Key: PIG-1483 > URL: https://issues.apache.org/jira/browse/PIG-1483 > Project: Pig > Issue Type: New Feature >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1483.patch, PIG-1483_1.patch > > > PIG-1333 added many script-related entries to the MR job xml file and thus > it's now possible to use Pig for querying Hadoop job history/xml files to get > script-level usage statistics. What we need is a Pig loader that can parse > these files and generate corresponding data objects. > The goal of this jira is to create a HadoopJobHistoryLoader in piggybank. > Here is an example that shows the intended usage: > *Find all the jobs grouped by script and user:* > {code} > a = load '/mapred/history/_logs/history/' using HadoopJobHistoryLoader() as > (j:map[], m:map[], r:map[]); > b = foreach a generate (Chararray) j#'PIG_SCRIPT_ID' as id, (Chararray) > j#'USER' as user, (Chararray) j#'JOBID' as job; > c = filter b by not (id is null); > d = group c by (id, user); > e = foreach d generate flatten(group), c.job; > dump e; > {code} > A couple more examples: > *Find scripts that use only the default parallelism:* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) r#'NUMBER_REDUCES' as reduces; > c = group b by (id, user, script_name) parallel 10; > d = foreach c generate group.user, group.script_name, MAX(b.reduces) as > max_reduces; > e = filter d by max_reduces == 1; > dump e; > {code} > *Find the running time of each script (in seconds):* > {code} > a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], > m:map[], r:map[]); > b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' > as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as > end; > c = group b by (id, user, script_name) > d = foreach c generate group.user, group.script_name, (MAX(b.end) - > MIN(b.start)/1000; > dump d; > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1557) couple of issue mapping aliases to jobs
[ https://issues.apache.org/jira/browse/PIG-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904456#action_12904456 ] Richard Ding commented on PIG-1557: --- Patch committed to trunk. > couple of issue mapping aliases to jobs > --- > > Key: PIG-1557 > URL: https://issues.apache.org/jira/browse/PIG-1557 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Olga Natkovich >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1557.patch, PIG-1557_1.patch > > > I have a simple script: > A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); > B = group A by name; > C = foreach B generate group, COUNT(A); > D = order C by $1; > E = limit D 10; > dump E; > I noticed a couple of issues with alias to job mapping: neither load(A) nor > limit(E) shows in the output -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1334) Make pig artifacts available through maven
[ https://issues.apache.org/jira/browse/PIG-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905744#action_12905744 ] Richard Ding commented on PIG-1334: --- Scott, Please create a new Jira for this. Another follow-up jira (PIG-1562) has already been opened. -Richard > Make pig artifacts available through maven > -- > > Key: PIG-1334 > URL: https://issues.apache.org/jira/browse/PIG-1334 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: mvn-pig.patch, mvn_pig_2.patch, mvn_pig_3.patch, > mvn_pig_4.patch, mvn_pig_5.patch, mvn_pig_6.patch > > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file
[ https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1548: -- Attachment: PIG-1458.patch Results of test-patch: {code} [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. {code} > Optimize scalar to consolidate the part file > > > Key: PIG-1548 > URL: https://issues.apache.org/jira/browse/PIG-1548 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1458.patch > > > Current scalar implementation will write a scalar file onto dfs. When Pig > need the scalar, it will open the dfs file directly. Each scalar file > contains more than one part file though it contains only one record. This > puts a huge load to namenode. We should consolidate part file before open it. > Another optional step is put the consolicated file into distributed cache. > This further bring down the load of namenode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file
[ https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1548: -- Status: Patch Available (was: Open) > Optimize scalar to consolidate the part file > > > Key: PIG-1548 > URL: https://issues.apache.org/jira/browse/PIG-1548 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1458.patch > > > Current scalar implementation will write a scalar file onto dfs. When Pig > need the scalar, it will open the dfs file directly. Each scalar file > contains more than one part file though it contains only one record. This > puts a huge load to namenode. We should consolidate part file before open it. > Another optional step is put the consolicated file into distributed cache. > This further bring down the load of namenode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file
[ https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1548: -- Attachment: PIG-1548.patch > Optimize scalar to consolidate the part file > > > Key: PIG-1548 > URL: https://issues.apache.org/jira/browse/PIG-1548 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1548.patch > > > Current scalar implementation will write a scalar file onto dfs. When Pig > need the scalar, it will open the dfs file directly. Each scalar file > contains more than one part file though it contains only one record. This > puts a huge load to namenode. We should consolidate part file before open it. > Another optional step is put the consolicated file into distributed cache. > This further bring down the load of namenode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file
[ https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1548: -- Attachment: (was: PIG-1458.patch) > Optimize scalar to consolidate the part file > > > Key: PIG-1548 > URL: https://issues.apache.org/jira/browse/PIG-1548 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1548.patch > > > Current scalar implementation will write a scalar file onto dfs. When Pig > need the scalar, it will open the dfs file directly. Each scalar file > contains more than one part file though it contains only one record. This > puts a huge load to namenode. We should consolidate part file before open it. > Another optional step is put the consolicated file into distributed cache. > This further bring down the load of namenode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1543) IsEmpty returns the wrong value after using LIMIT
[ https://issues.apache.org/jira/browse/PIG-1543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906008#action_12906008 ] Richard Ding commented on PIG-1543: --- +1. Looks good. > IsEmpty returns the wrong value after using LIMIT > - > > Key: PIG-1543 > URL: https://issues.apache.org/jira/browse/PIG-1543 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Justin Hu >Assignee: Daniel Dai > Fix For: 0.8.0 > > Attachments: PIG-1543-1.patch > > > 1. Two input files: > 1a: limit_empty.input_a > 1 > 1 > 1 > 1b: limit_empty.input_b > 2 > 2 > 2. > The pig script: limit_empty.pig > -- A contains only 1's & B contains only 2's > A = load 'limit_empty.input_a' as (a1:int); > B = load 'limit_empty.input_a' as (b1:int); > C =COGROUP A by a1, B by b1; > D = FOREACH C generate A, B, (IsEmpty(A)? 0:1), (IsEmpty(B)? 0:1), COUNT(A), > COUNT(B); > store D into 'limit_empty.output/d'; > -- After the script done, we see the right results: > -- {(1),(1),(1)} {} 1 0 3 0 > -- {} {(2),(2)} 0 1 0 2 > C1 = foreach C { Alim = limit A 1; Blim = limit B 1; generate Alim, Blim; } > D1 = FOREACH C1 generate Alim,Blim, (IsEmpty(Alim)? 0:1), (IsEmpty(Blim)? > 0:1), COUNT(Alim), COUNT(Blim); > store D1 into 'limit_empty.output/d1'; > -- After the script done, we see the unexpected results: > -- {(1)} {}1 1 1 0 > -- {} {(2)} 1 1 0 1 > dump D; > dump D1; > 3. Run the scrip and redirect the stdout (2 dumps) file. There are two issues: > The major one: > IsEmpty() returns FALSE for empty bag in limit_empty.output/d1/*, while > IsEmpty() returns correctly in limit_empty.output/d/*. > The difference is that one has been applied with "LIMIT" before using > IsEmpty(). > The minor one: > The redirected output only contains the first dump: > ({(1),(1),(1)},{},1,0,3L,0L) > ({},{(2),(2)},0,1,0L,2L) > We expect two more lines like: > ({(1)},{},1,1,1L,0L) > ({},{(2)},1,1,0L,1L) > Besides, there is error says: > [main] ERROR org.apache.pig.backend.hadoop.executionengine.HJob - > java.lang.ClassCastException: java.lang.Integer cannot be cast to > org.apache.pig.data.Tuple -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file
[ https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1548: -- Attachment: PIG-1548_1.patch The patch excludes some multiquery cases where more information is needed to correlate and determine the files to consolidate. We'll consider those cases in a separate jira. > Optimize scalar to consolidate the part file > > > Key: PIG-1548 > URL: https://issues.apache.org/jira/browse/PIG-1548 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1548.patch, PIG-1548_1.patch > > > Current scalar implementation will write a scalar file onto dfs. When Pig > need the scalar, it will open the dfs file directly. Each scalar file > contains more than one part file though it contains only one record. This > puts a huge load to namenode. We should consolidate part file before open it. > Another optional step is put the consolicated file into distributed cache. > This further bring down the load of namenode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1599) pig gives generic message for few cases
[ https://issues.apache.org/jira/browse/PIG-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906153#action_12906153 ] Richard Ding commented on PIG-1599: --- I manually run related tests and they all passed. I'm going to check in the patch to the trunk and 0.8 branch. > pig gives generic message for few cases > --- > > Key: PIG-1599 > URL: https://issues.apache.org/jira/browse/PIG-1599 > Project: Pig > Issue Type: Bug >Reporter: niraj rai >Assignee: niraj rai > Attachments: pig-1599_0.patch, pig-1599_1.patch > > > When we run the script: > register testudf.jar; > a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); > b = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); > c = cogroup a by name, b by name; > d = foreach c generate flatten(org.apache.pig.test.udf.evalfunc.BadUdf(a,b)); > dump d; > we get the error: > now we get "ERROR 2088: Unable to get results for: > hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp1787360727/tmp509618997:org.apache.pig.impl.io.InterStorage". > The udf is bad udf and it should throw: > ERROR 2078: Caught error from UDF: org.apache.pig.test.udf.evalfunc.BadUdf, > Out of bounds access [Index: 2, Size: 2] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-1599) pig gives generic message for few cases
[ https://issues.apache.org/jira/browse/PIG-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding resolved PIG-1599. --- Hadoop Flags: [Reviewed] Resolution: Fixed Patch is committed to both trunk and 0.8 branch. Thanks Niraj. > pig gives generic message for few cases > --- > > Key: PIG-1599 > URL: https://issues.apache.org/jira/browse/PIG-1599 > Project: Pig > Issue Type: Bug >Reporter: niraj rai >Assignee: niraj rai > Attachments: pig-1599_0.patch, pig-1599_1.patch > > > When we run the script: > register testudf.jar; > a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); > b = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); > c = cogroup a by name, b by name; > d = foreach c generate flatten(org.apache.pig.test.udf.evalfunc.BadUdf(a,b)); > dump d; > we get the error: > now we get "ERROR 2088: Unable to get results for: > hdfs://wilbur20.labs.corp.sp1.yahoo.com:9020/tmp/temp1787360727/tmp509618997:org.apache.pig.impl.io.InterStorage". > The udf is bad udf and it should throw: > ERROR 2078: Caught error from UDF: org.apache.pig.test.udf.evalfunc.BadUdf, > Out of bounds access [Index: 2, Size: 2] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1548) Optimize scalar to consolidate the part file
[ https://issues.apache.org/jira/browse/PIG-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1548: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed patch committed to both trunk and 0.8 branch. > Optimize scalar to consolidate the part file > > > Key: PIG-1548 > URL: https://issues.apache.org/jira/browse/PIG-1548 > Project: Pig > Issue Type: Improvement > Components: impl >Reporter: Daniel Dai >Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1548.patch, PIG-1548_1.patch > > > Current scalar implementation will write a scalar file onto dfs. When Pig > need the scalar, it will open the dfs file directly. Each scalar file > contains more than one part file though it contains only one record. This > puts a huge load to namenode. We should consolidate part file before open it. > Another optional step is put the consolicated file into distributed cache. > This further bring down the load of namenode. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1479) Embed Pig in scripting languages
[ https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1479: -- Attachment: PIG-1479.patch Thanks Julien. I rebased the patch with the latest trunk and added an option (-greek) in the Main class. Now one can run a "PIG-Greek" script with following command: {code} java -cp pig.jar:: org.apache.pig.Main -g {code} or in local mode: {code} java -cp pig.jar: org.apache.pig.Main -x local -g {code} > Embed Pig in scripting languages > > > Key: PIG-1479 > URL: https://issues.apache.org/jira/browse/PIG-1479 > Project: Pig > Issue Type: New Feature >Reporter: Julien Le Dem > Attachments: PIG-1479.patch, pig-greek.tgz > > > It should be possible to embed Pig calls in a scripting language and let > functions defined in the same script available as UDFs. > This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which > lets users define UDFs in scripting languages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1562) Fix the version for the dependent packages for the maven
[ https://issues.apache.org/jira/browse/PIG-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1562: -- Status: Resolved (was: Patch Available) Hadoop Flags: [Reviewed] Resolution: Fixed Patch committed to both trunk and 0.8 branch. Thanks Niraj!. > Fix the version for the dependent packages for the maven > - > > Key: PIG-1562 > URL: https://issues.apache.org/jira/browse/PIG-1562 > Project: Pig > Issue Type: Bug >Reporter: niraj rai >Assignee: niraj rai > Fix For: 0.8.0 > > Attachments: PIG-1562_1.patch, PIG-1562_2.patch, PIG_1562_0.patch > > > We need to fix the set version so that, version is properly set for the > dependent packages in the maven repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-630) provide indication that pig script only partially succeeded
[ https://issues.apache.org/jira/browse/PIG-630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding resolved PIG-630. -- Assignee: Olga Natkovich Fix Version/s: 0.8.0 Resolution: Fixed This jira has been fixed with MultiQuery optimization and Pig Stats. > provide indication that pig script only partially succeeded > --- > > Key: PIG-630 > URL: https://issues.apache.org/jira/browse/PIG-630 > Project: Pig > Issue Type: Bug >Reporter: Olga Natkovich >Assignee: Olga Natkovich > Fix For: 0.8.0 > > > Currently, if you have multiple queries (stores/dumps) within the same pig > script, the script return the result of the last one which does not provide > sufficient information to the users. We need to provide to the user the > following information: > - return code that indicates the script only partioally succeeded > - indication which parts have succeeded -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1589) add test cases for mapreduce operator which use distributed cache
[ https://issues.apache.org/jira/browse/PIG-1589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909061#action_12909061 ] Richard Ding commented on PIG-1589: --- +1 > add test cases for mapreduce operator which use distributed cache > - > > Key: PIG-1589 > URL: https://issues.apache.org/jira/browse/PIG-1589 > Project: Pig > Issue Type: Task >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1589.1.patch, TestWordCount.jar > > > '-files filename' can be specified in the parameters for mapreduce operator > to send files to distributed cache. Need to add test cases for that. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1609) 'union onschema' should give a more useful error message when schema of one of the relations has null column name
[ https://issues.apache.org/jira/browse/PIG-1609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909412#action_12909412 ] Richard Ding commented on PIG-1609: --- +1 > 'union onschema' should give a more useful error message when schema of one > of the relations has null column name > - > > Key: PIG-1609 > URL: https://issues.apache.org/jira/browse/PIG-1609 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1609.1.patch > > > A better error message needs to be given in this case - > {code} > grunt> l = load '/tmp/empty.bag' as (i : int); > grunt> f = foreach l generate i+1; > grunt> describe f; > f: {int} > grunt> u = union onschema l , f; > 2010-09-10 18:08:13,000 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1000: Error during parsing. Error merging > schemas for union operator > Details at logfile: /Users/tejas/pig_nmr_syn/trunk/pig_1284167020897.log > {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1479) Embed Pig in scripting languages
[ https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1479: -- Attachment: PIG-1479_2.patch In the previous patch, the executeScript method on ScriptPigServer returns a list of ExecJobs (one for each store statement in the script). Unfortunately, the order of ExecJobs in the list is indeterminate. This patch fixes this problem by making the executeScript method return a PigStats object. One then can retrieves the output result by the alias corresponding to store statement. Here is a example: {code} P = pig.executeScript(""" A = load '${input}'; ... ... store G into '${output}'; """) output = P.result("G") # an OutputStats object iter = output.iterator() if iter.hasNext(): # do something else: # do something else {code} > Embed Pig in scripting languages > > > Key: PIG-1479 > URL: https://issues.apache.org/jira/browse/PIG-1479 > Project: Pig > Issue Type: New Feature >Reporter: Julien Le Dem > Attachments: PIG-1479.patch, PIG-1479_2.patch, pig-greek.tgz > > > It should be possible to embed Pig calls in a scripting language and let > functions defined in the same script available as UDFs. > This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which > lets users define UDFs in scripting languages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1479) Embed Pig in scripting languages
[ https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1479: -- Attachment: pig-greek-test.tar Attach the updated test program from Julien. To run the example: * tar -xvf pig-greek-test.tar * java -cp pig.jar: org.apache.pig.Main -x local -g script/tc.py > Embed Pig in scripting languages > > > Key: PIG-1479 > URL: https://issues.apache.org/jira/browse/PIG-1479 > Project: Pig > Issue Type: New Feature >Reporter: Julien Le Dem > Attachments: PIG-1479.patch, PIG-1479_2.patch, pig-greek-test.tar, > pig-greek.tgz > > > It should be possible to embed Pig calls in a scripting language and let > functions defined in the same script available as UDFs. > This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which > lets users define UDFs in scripting languages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1607) pig should have separate javadoc.jar in the maven repository
[ https://issues.apache.org/jira/browse/PIG-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909814#action_12909814 ] Richard Ding commented on PIG-1607: --- The test result can be viewed here: {code} https://repository.apache.org/content/repositories/snapshots/org/apache/hadoop/pig/0.8.0-SNAPSHOT/ {code} > pig should have separate javadoc.jar in the maven repository > > > Key: PIG-1607 > URL: https://issues.apache.org/jira/browse/PIG-1607 > Project: Pig > Issue Type: Bug >Reporter: niraj rai >Assignee: niraj rai > Attachments: PIG-1607_0.patch, PIG-1607_1.patch, PIG-1607_2.patch > > > At this moment, javadoc is part of the source.jar but pig should have > separate javadoc.jar in the maven repository. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1615) Return code from Pig is 0 even if the job fails when using -M flag
[ https://issues.apache.org/jira/browse/PIG-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910407#action_12910407 ] Richard Ding commented on PIG-1615: --- This problem exists in Pig 0.7 and fixed in Pig 0.8. > Return code from Pig is 0 even if the job fails when using -M flag > -- > > Key: PIG-1615 > URL: https://issues.apache.org/jira/browse/PIG-1615 > Project: Pig > Issue Type: Bug >Affects Versions: 0.6.0, 0.7.0 >Reporter: Viraj Bhat > Fix For: 0.8.0 > > > I have a Pig script of this form, which I used inside a workflow system such > as Oozie. > {code} > A = load '$INPUT' using PigStorage(); > store A into '$OUTPUT'; > {code} > I run this as with Multi-query optimization turned off : > {quote} > $java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p > INPUT=/user/viraj/junk1 -M -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig > {quote} > The directory "/user/viraj/junk1" is not present > I get the following results: > {quote} > Input(s): > Failed to read data from "/user/viraj/junk1" > Output(s): > Failed to produce result in "/user/viraj/junk2" > {quote} > This is expected, but the return code is still 0 > {code} > $ echo $? > 0 > {code} > If I run this script with Multi-query optimization turned on, it gives, a > return code of 2, which is correct. > {code} > $ java -cp ~/pig-svn/trunk/pig.jar:$HADOOP_CONF_DIR org.apache.pig.Main -p > INPUT=/user/viraj/junk1 -p OUTPUT=/user/viraj/junk2 loadpigstorage.pig > ... > $ echo $? > 2 > {code} > I believe a wrong return code from Pig, is causing Oozie to believe that Pig > script succeeded. > Viraj -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1610) 'union onschema' does handle some cases involving 'namespaced' column names in schema
[ https://issues.apache.org/jira/browse/PIG-1610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910409#action_12910409 ] Richard Ding commented on PIG-1610: --- +1 > 'union onschema' does handle some cases involving 'namespaced' column names > in schema > - > > Key: PIG-1610 > URL: https://issues.apache.org/jira/browse/PIG-1610 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1610.1.patch, PIG-1610.2.patch > > > case 1: > grunt> describe f; > f: {l1::a: bytearray,l1::b: bytearray} > grunt> describe l1; > l1: {a: bytearray,b: bytearray} > grunt> dump f; > (1,11) > (2,22) > (3,33) > grunt> dump l1; > (1,11) > (2,22) > (3,33) > grunt> u = union onschema f, l1; > grunt> describe u; > u: {l1::a: bytearray,l1::b: bytearray} > -- the dump u gives incorrect results > grunt> dump u; > (,) > (,) > (,) > (1,11) > (2,22) > (3,33) > case 2: > grunt> u = union onschema l1, f; > grunt> describe u; > 2010-09-13 15:11:13,877 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR > 1108: Duplicate schema alias: l1::a > Details at logfile: /Users/tejas/pig_unions_err2/trunk/pig_1284410413970.log -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1479) Embed Pig in scripting languages
[ https://issues.apache.org/jira/browse/PIG-1479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1479: -- Attachment: pig-greek-test.tar Attach the test script modified based on Julien's comment. As for commend line option -g, it can also use one parameter (script file name) and let Pig determine the script engine by the file extension. > Embed Pig in scripting languages > > > Key: PIG-1479 > URL: https://issues.apache.org/jira/browse/PIG-1479 > Project: Pig > Issue Type: New Feature >Reporter: Julien Le Dem > Attachments: PIG-1479.patch, PIG-1479_2.patch, pig-greek-test.tar, > pig-greek-test.tar, pig-greek.tgz > > > It should be possible to embed Pig calls in a scripting language and let > functions defined in the same script available as UDFs. > This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which > lets users define UDFs in scripting languages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1616) 'union onschema' does not use create output with correct schema when udfs are involved
[ https://issues.apache.org/jira/browse/PIG-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912696#action_12912696 ] Richard Ding commented on PIG-1616: --- +1 > 'union onschema' does not use create output with correct schema when udfs are > involved > -- > > Key: PIG-1616 > URL: https://issues.apache.org/jira/browse/PIG-1616 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Fix For: 0.8.0 > > Attachments: PIG-1616.1.patch > > > 'union onshcema' creates a merged schema based on the input schemas. It does > that in the queryparser, and at that stage the udf return type used is the > default return type. The actual return type for the udf is determined later > in the TypeCheckingVisitor using EvalFunc.getArgsToFuncMapping(). > 'union onschema' should use the final type for its input relation to create > the merged schema. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1641) Incorrect counters in local mode
[ https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913736#action_12913736 ] Richard Ding commented on PIG-1641: --- Hadoop counters are not available in local mode (PIG-1286). So for now I propose that, in local mode, Pig stats output is changed to something like the following: {code} Job Stats (time in seconds): JobId Alias Feature Outputs job_local_0001 raw MAP_ONLY job_local_0002 rank_sort SAMPLER job_local_0003 rank_sort ORDER_BY Processed/user_visits_table, Input(s): Successfully read records from: "Data/Raw/UserVisits.dat" Output(s): Successfully stored records in: "Processed/user_visits_table" {code} > Incorrect counters in local mode > > > Key: PIG-1641 > URL: https://issues.apache.org/jira/browse/PIG-1641 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Ashutosh Chauhan > > User report, not verified. > > HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures > 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 > 21:58:42ORDER_BY > Success! > Job Stats (time in seconds): > JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime > MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs > job_local_000100000000rawMAP_ONLY > job_local_000200000000rank_sort > SAMPLER > job_local_000300000000rank_sort > ORDER_BYProcessed/user_visits_table, > Input(s): > Successfully read 0 records from: "Data/Raw/UserVisits.dat" > Output(s): > Successfully stored 0 records in: "Processed/user_visits_table" > However, when I look in the output: > $ ls -lh Processed/user_visits_table/CG0/ > total 15250760 > -rwxrwxrwx 1 user _lpoperator 7.3G Sep 21 21:58 part-0* > It read a 20G input file and generated some output... > > Is it that in local mode counters are not available? If so, instead of > printing zeros we should print "Information Unavailable" or some such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1641) Incorrect counters in local mode
[ https://issues.apache.org/jira/browse/PIG-1641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding reassigned PIG-1641: - Assignee: Richard Ding > Incorrect counters in local mode > > > Key: PIG-1641 > URL: https://issues.apache.org/jira/browse/PIG-1641 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Ashutosh Chauhan >Assignee: Richard Ding > > User report, not verified. > > HadoopVersionPigVersionUserIdStartedAtFinishedAtFeatures > 0.20.20.8.0-SNAPSHOTuser2010-09-21 19:25:582010-09-21 > 21:58:42ORDER_BY > Success! > Job Stats (time in seconds): > JobIdMapsReducesMaxMapTimeMinMapTImeAvgMapTime > MaxReduceTimeMinReduceTimeAvgReduceTimeAliasFeatureOutputs > job_local_000100000000rawMAP_ONLY > job_local_000200000000rank_sort > SAMPLER > job_local_000300000000rank_sort > ORDER_BYProcessed/user_visits_table, > Input(s): > Successfully read 0 records from: "Data/Raw/UserVisits.dat" > Output(s): > Successfully stored 0 records in: "Processed/user_visits_table" > However, when I look in the output: > $ ls -lh Processed/user_visits_table/CG0/ > total 15250760 > -rwxrwxrwx 1 user _lpoperator 7.3G Sep 21 21:58 part-0* > It read a 20G input file and generated some output... > > Is it that in local mode counters are not available? If so, instead of > printing zeros we should print "Information Unavailable" or some such. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1642) Order by doesn't use estimation to determine the parallelism
[ https://issues.apache.org/jira/browse/PIG-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1642: -- Summary: Order by doesn't use estimation to determine the parallelism (was: Order by doesn't use estimation to determine the paralelism) > Order by doesn't use estimation to determine the parallelism > > > Key: PIG-1642 > URL: https://issues.apache.org/jira/browse/PIG-1642 > Project: Pig > Issue Type: Bug >Affects Versions: 0.8.0 >Reporter: Richard Ding > Fix For: 0.8.0 > > > With PIG-1249, a simple heuristic is used to determine the number of reducers > if it isn't specified (via PARALLEL or default_parallel). For order by > statement, however, it still defaults to 1. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.