[jira] Updated: (PIG-1336) Optimize POStore serialized into JobConf

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1336:


Description: We serialize POStore too early in the JobControlCompiler. At 
that time, storeFunc have unconstraint link to other operator; in the worst 
case, it will chain the whole physical plan. Also, in multi-store case, POStore 
has link to its data source, which is not needed and will increase the 
footprint of serialized POStore.   (was: Currently, if a pig script job 
contains multiple map-reduce jobs, each job will serialize all map-reduce job 
plan into JobConf. The reason is PhysicalOperator.inputs is built in Physical 
plan and in fact it chains all the jobs without regard to map-reduce boundary. 
Further, when we only want to serialize POStore, we serialize this whole plan 
again due to POStore.inputs. This should be fixed. )
Summary: Optimize POStore serialized into JobConf  (was: Optimize 
content of mapPlan/reducePlan to be serialized into JobConf)

 Optimize POStore serialized into JobConf
 

 Key: PIG-1336
 URL: https://issues.apache.org/jira/browse/PIG-1336
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1336-1.patch, PIG-1336-2.patch


 We serialize POStore too early in the JobControlCompiler. At that time, 
 storeFunc have unconstraint link to other operator; in the worst case, it 
 will chain the whole physical plan. Also, in multi-store case, POStore has 
 link to its data source, which is not needed and will increase the footprint 
 of serialized POStore. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1336) Optimize POStore serialized into JobConf

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1336:


Status: Open  (was: Patch Available)

 Optimize POStore serialized into JobConf
 

 Key: PIG-1336
 URL: https://issues.apache.org/jira/browse/PIG-1336
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1336-1.patch, PIG-1336-2.patch, PIG-1336-3.patch


 We serialize POStore too early in the JobControlCompiler. At that time, 
 storeFunc have unconstraint link to other operator; in the worst case, it 
 will chain the whole physical plan. Also, in multi-store case, POStore has 
 link to its data source, which is not needed and will increase the footprint 
 of serialized POStore. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1336) Optimize POStore serialized into JobConf

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1336:


Attachment: PIG-1336-3.patch

 Optimize POStore serialized into JobConf
 

 Key: PIG-1336
 URL: https://issues.apache.org/jira/browse/PIG-1336
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1336-1.patch, PIG-1336-2.patch, PIG-1336-3.patch


 We serialize POStore too early in the JobControlCompiler. At that time, 
 storeFunc have unconstraint link to other operator; in the worst case, it 
 will chain the whole physical plan. Also, in multi-store case, POStore has 
 link to its data source, which is not needed and will increase the footprint 
 of serialized POStore. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1336) Optimize POStore serialized into JobConf

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1336:


Status: Patch Available  (was: Open)

 Optimize POStore serialized into JobConf
 

 Key: PIG-1336
 URL: https://issues.apache.org/jira/browse/PIG-1336
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1336-1.patch, PIG-1336-2.patch, PIG-1336-3.patch


 We serialize POStore too early in the JobControlCompiler. At that time, 
 storeFunc have unconstraint link to other operator; in the worst case, it 
 will chain the whole physical plan. Also, in multi-store case, POStore has 
 link to its data source, which is not needed and will increase the footprint 
 of serialized POStore. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-200) Pig Performance Benchmarks

2010-03-30 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851293#action_12851293
 ] 

Daniel Dai commented on PIG-200:


Hi, duncan,
I tried and I didn't see errors. Are you using pig 0.6 release? What error 
message did you see?

 Pig Performance Benchmarks
 --

 Key: PIG-200
 URL: https://issues.apache.org/jira/browse/PIG-200
 Project: Pig
  Issue Type: Task
Reporter: Amir Youssefi
Assignee: Alan Gates
 Fix For: 0.2.0

 Attachments: generate_data.pl, perf-0.6.patch, perf.hadoop.patch, 
 perf.patch


 To benchmark Pig performance, we need to have a TPC-H like Large Data Set 
 plus Script Collection. This is used in comparison of different Pig releases, 
 Pig vs. other systems (e.g. Pig + Hadoop vs. Hadoop Only).
 Here is Wiki for small tests: http://wiki.apache.org/pig/PigPerformance
 I am currently running long-running Pig scripts over data-sets in the order 
 of tens of TBs. Next step is hundreds of TBs.
 We need to have an open large-data set (open source scripts which generate 
 data-set) and detailed scripts for important operations such as ORDER, 
 AGGREGATION etc.
 We can call those the Pig Workouts: Cardio (short processing), Marathon (long 
 running scripts) and Triathlon (Mix). 
 I will update this JIRA with more details of current activities soon.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1309) Map-side Cogroup

2010-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851323#action_12851323
 ] 

Hadoop QA commented on PIG-1309:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12440159/pig-1309_1.patch
  against trunk revision 928950.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

-1 javac.  The applied patch generated 88 javac compiler warnings (more 
than the trunk's current 87 warnings).

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/258/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/258/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/258/console

This message is automatically generated.

 Map-side Cogroup
 

 Key: PIG-1309
 URL: https://issues.apache.org/jira/browse/PIG-1309
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Attachments: mapsideCogrp.patch, pig-1309_1.patch


 In never ending quest to make Pig go faster, we want to parallelize as many 
 relational operations as possible. Its already possible to do Group-by( 
 PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
 is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851338#action_12851338
 ] 

Hadoop QA commented on PIG-1338:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12440177/PIG-1338-1.patch
  against trunk revision 928950.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 79 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/270/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/270/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/270/console

This message is automatically generated.

 Pig should exclude hadoop conf in local mode
 

 Key: PIG-1338
 URL: https://issues.apache.org/jira/browse/PIG-1338
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1338-1.patch


 Currently, the behavior for hadoop conf look up is:
 * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
 conf, launch local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, still launch without warning, but many functionality will go wrong
 We should bring it to a more intuitive way, which is:
 * in local mode, always launch Pig in local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1336) Optimize POStore serialized into JobConf

2010-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851404#action_12851404
 ] 

Hadoop QA commented on PIG-1336:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12440184/PIG-1336-3.patch
  against trunk revision 928950.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/259/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/259/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/259/console

This message is automatically generated.

 Optimize POStore serialized into JobConf
 

 Key: PIG-1336
 URL: https://issues.apache.org/jira/browse/PIG-1336
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1336-1.patch, PIG-1336-2.patch, PIG-1336-3.patch


 We serialize POStore too early in the JobControlCompiler. At that time, 
 storeFunc have unconstraint link to other operator; in the worst case, it 
 will chain the whole physical plan. Also, in multi-store case, POStore has 
 link to its data source, which is not needed and will increase the footprint 
 of serialized POStore. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: jira-1229-v2.patch

Here is the updated patch that compiles against pig 0.7 branch and implements 
new load/store APIs. 

Note:- that I haven't used hadoop's DBOutputFormat as the code is not yet moved 
to o.p.h.mapreduce.lib and hence there are compatibility issues.

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch, jira-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: (was: hsqldb.jar)

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch, jira-1229.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1295) Binary comparator for secondary sort

2010-03-30 Thread Gianmarco De Francisci Morales (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851453#action_12851453
 ] 

Gianmarco De Francisci Morales commented on PIG-1295:
-

Hi, 
I have been reading the source code and the referenced PIG-1038 issue. 

Probably Avro integration is too big of a project for GSoC, but implementing 
the tuple binary comparator seems doable. 
I will write a proposal, any advices for it? 

My idea of the project's breakdown would be like this: 

Identify the cases that can be optimized and the appropriate visitor for those. 
Write a test unit for this optimization. 
Implement the comparator knowing the data types of the tuple. 
Write a second test unit with different types. 
Write the logic to extract tuple boundary from schema information (I suppose 
this optimization is possible only if the schema is known) 
Try to extend it to the general case of complex data type as secondary key. 

Thoughts?

 Binary comparator for secondary sort
 

 Key: PIG-1295
 URL: https://issues.apache.org/jira/browse/PIG-1295
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai

 When hadoop framework doing the sorting, it will try to use binary version of 
 comparator if available. The benefit of binary comparator is we do not need 
 to instantiate the object before we compare. We see a ~30% speedup after we 
 switch to binary comparator. Currently, Pig use binary comparator in 
 following case:
 1. When semantics of order doesn't matter. For example, in distinct, we need 
 to do a sort in order to filter out duplicate values; however, we do not care 
 how comparator sort keys. Groupby also share this character. In this case, we 
 rely on hadoop's default binary comparator
 2. Semantics of order matter, but the key is of simple type. In this case, we 
 have implementation for simple types, such as integer, long, float, 
 chararray, databytearray, string
 However, if the key is a tuple and the sort semantics matters, we do not have 
 a binary comparator implementation. This especially matters when we switch to 
 use secondary sort. In secondary sort, we convert the inner sort of nested 
 foreach into the secondary key and rely on hadoop to sorting on both main key 
 and secondary key. The sorting key will become a two items tuple. Since the 
 secondary key the sorting key of the nested foreach, so the sorting semantics 
 matters. It turns out we do not have binary comparator once we use secondary 
 sort, and we see a significant slow down.
 Binary comparator for tuple should be doable once we understand the binary 
 structure of the serialized tuple. We can focus on most common use cases 
 first, which is group by followed by a nested sort. In this case, we will 
 use secondary sort. Semantics of the first key does not matter but semantics 
 of secondary key matters. We need to identify the boundary of main key and 
 secondary key in the binary tuple buffer without instantiate tuple itself. 
 Then if the first key equals, we use a binary comparator to compare secondary 
 key. Secondary key can also be a complex data type, but for the first step, 
 we focus on simple secondary key, which is the most common use case.
 We mark this issue to be a candidate project for Google summer of code 2010 
 program. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Attachment: (was: jira-1229.patch)

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Status: In Progress  (was: Patch Available)

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Ankur (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur updated PIG-1229:
---

Status: Patch Available  (was: In Progress)

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851455#action_12851455
 ] 

Olga Natkovich commented on PIG-1229:
-

Since we already branched, this feature will not go into 0.7.0 branch but would 
instead be committed to trunk and released as part of 0.8.0 release. I think 
this patch should work just fine against trunk since we have noit deviated much.

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1338:


Status: Patch Available  (was: Open)

 Pig should exclude hadoop conf in local mode
 

 Key: PIG-1338
 URL: https://issues.apache.org/jira/browse/PIG-1338
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1338-1.patch, PIG-1338-2.patch


 Currently, the behavior for hadoop conf look up is:
 * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
 conf, launch local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, still launch without warning, but many functionality will go wrong
 We should bring it to a more intuitive way, which is:
 * in local mode, always launch Pig in local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1338:


Attachment: PIG-1338-2.patch

 Pig should exclude hadoop conf in local mode
 

 Key: PIG-1338
 URL: https://issues.apache.org/jira/browse/PIG-1338
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1338-1.patch, PIG-1338-2.patch


 Currently, the behavior for hadoop conf look up is:
 * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
 conf, launch local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, still launch without warning, but many functionality will go wrong
 We should bring it to a more intuitive way, which is:
 * in local mode, always launch Pig in local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1338:


Status: Open  (was: Patch Available)

 Pig should exclude hadoop conf in local mode
 

 Key: PIG-1338
 URL: https://issues.apache.org/jira/browse/PIG-1338
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1338-1.patch, PIG-1338-2.patch


 Currently, the behavior for hadoop conf look up is:
 * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
 conf, launch local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, still launch without warning, but many functionality will go wrong
 We should bring it to a more intuitive way, which is:
 * in local mode, always launch Pig in local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1337) Need a way to pass distributed cache configuration information to hadoop backend in Pig's LoadFunc

2010-03-30 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851479#action_12851479
 ] 

Pradeep Kamath commented on PIG-1337:
-

My worry in doing these kinds of job related updates in the Job in getSchema() 
is that currently getSchema has been designed to be a pure getter without any 
indirect set side effects - this is noted in the javadoc:

{noformat}
/**
 * Get a schema for the data to be loaded.  
 * @param location Location as returned by 
 * {...@link LoadFunc#relativeToAbsolutePath(String, 
org.apache.hadoop.fs.Path)}
 * @param job The {...@link Job} object - this should be used only to 
obtain 
 * cluster properties through {...@link Job#getConfiguration()} and not to 
set/query
 * any runtime job information.  
...
{noformat}

We should be careful in opening this up to allow set capability - something to 
consider before designing a fix for this issue.

 Need a way to pass distributed cache configuration information to hadoop 
 backend in Pig's LoadFunc
 --

 Key: PIG-1337
 URL: https://issues.apache.org/jira/browse/PIG-1337
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0
Reporter: Chao Wang
 Fix For: 0.8.0


 The Zebra storage layer needs to use distributed cache to reduce name node 
 load during job runs.
 To to this, Zebra needs to set up distributed cache related configuration 
 information in TableLoader (which extends Pig's LoadFunc) .
 It is doing this within getSchema(conf). The problem is that the conf object 
 here is not the one that is being serialized to map/reduce backend. As such, 
 the distributed cache is not set up properly.
 To work over this problem, we need Pig in its LoadFunc to ensure a way that 
 we can use to set up distributed cache information in a conf object, and this 
 conf object is the one used by map/reduce backend.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851489#action_12851489
 ] 

Pradeep Kamath commented on PIG-1338:
-

I haven't done a full review but had a comment on one of the changes which is 
pretty important:
{noformat}
Index: src/org/apache/pig/backend/hadoop/datastorage/ConfigurationUtil.java
===
--- src/org/apache/pig/backend/hadoop/datastorage/ConfigurationUtil.java
(revision 928370)
+++ src/org/apache/pig/backend/hadoop/datastorage/ConfigurationUtil.java
(working copy)
@@ -30,7 +30,9 @@
 
 public static Configuration toConfiguration(Properties properties) {
 assert properties != null;
-final Configuration config = new Configuration();
+final Configuration config = new Configuration(false);
+config.addResource(core-default.xml);
+config.addResource(mapred-default.xml);
 final EnumerationObject iter = properties.keys();
 while (iter.hasMoreElements()) {
 final String key = (String) iter.nextElement()
{noformat}

Looking at the Configuration class's implementation I found the following code:

{noformat}

 static{
//print deprecation warning if hadoop-site.xml is found in classpath
ClassLoader cL = Thread.currentThread().getContextClassLoader();
if (cL == null) {
  cL = Configuration.class.getClassLoader();
}
if(cL.getResource(hadoop-site.xml)!=null) {
  LOG.warn(DEPRECATED: hadoop-site.xml found in the classpath.  +
  Usage of hadoop-site.xml is deprecated. Instead use core-site.xml, 
  + mapred-site.xml and hdfs-site.xml to override properties of  +
  core-default.xml, mapred-default.xml and hdfs-default.xml  +
  respectively);
}
addDefaultResource(core-default.xml);
addDefaultResource(core-site.xml);
  }

  private void loadResources(Properties properties,
 ArrayList resources,
 boolean quiet) {
if(loadDefaults) {
  for (String resource : defaultResources) {
loadResource(properties, resource, quiet);
  }

  //support the hadoop-site.xml as a deprecated case
  if(getResource(hadoop-site.xml)!=null) {
loadResource(properties, hadoop-site.xml, quiet);
  }
}

for (Object resource : resources) {
  loadResource(properties, resource, quiet);
}
  }

{noformat}

There are two questions related to the code in Configuration Vs the change in 
this patch:
1) In the patch, core-default.xml and mapred-default.xml are added as resources 
while in Configuration core-default.xml and core-site.xml are added by default
2) In the patch, hadoop-site.xml is not considered while in Configuration, it 
is - so if a hadoop 20.x cluster is installed with hadoop-site.xml configured 
and without the other .xml files (like core-default.xml etc.) then pig would 
not get the cluster config information right?

 Pig should exclude hadoop conf in local mode
 

 Key: PIG-1338
 URL: https://issues.apache.org/jira/browse/PIG-1338
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1338-1.patch, PIG-1338-2.patch


 Currently, the behavior for hadoop conf look up is:
 * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
 conf, launch local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, still launch without warning, but many functionality will go wrong
 We should bring it to a more intuitive way, which is:
 * in local mode, always launch Pig in local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1310) ISO Date UDFs: Conversion, Trucation and Date Math

2010-03-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1310:
---

Assignee: Russell Jurney

 ISO Date UDFs: Conversion, Trucation and Date Math
 --

 Key: PIG-1310
 URL: https://issues.apache.org/jira/browse/PIG-1310
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.7.0

 Attachments: joda-mavenstuff.diff, pass.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I've written UDFs to handle loading unix times, datemonth values and ISO 8601 
 formatted date strings, and working with them as ISO datetimes using jodatime.
 The working code is here: 
 http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/
 It needs to be documented and tests added, and a couple UDFs are missing, but 
 these work if you REGISTER the jodatime jar in your script.  Hopefully I can 
 get this stuff in piggybank before someone else writes it this time :)  The 
 rounding also may not be performant, but the code works.
 Ultimately I'd also like to enable support for ISO 8601 durations.  Someone 
 slap me if this isn't done soon, it is not much work and this should help 
 everyone working with time series.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1336) Optimize POStore serialized into JobConf

2010-03-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851518#action_12851518
 ] 

Richard Ding commented on PIG-1336:
---

In the multi-store case, the parent plan can be set in POStore and should also 
be unlinked.

 Optimize POStore serialized into JobConf
 

 Key: PIG-1336
 URL: https://issues.apache.org/jira/browse/PIG-1336
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1336-1.patch, PIG-1336-2.patch, PIG-1336-3.patch


 We serialize POStore too early in the JobControlCompiler. At that time, 
 storeFunc have unconstraint link to other operator; in the worst case, it 
 will chain the whole physical plan. Also, in multi-store case, POStore has 
 link to its data source, which is not needed and will increase the footprint 
 of serialized POStore. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1335) UDFFinder should find LoadFunc used by POCast

2010-03-30 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851519#action_12851519
 ] 

Richard Ding commented on PIG-1335:
---

+1

 UDFFinder should find LoadFunc used by POCast
 -

 Key: PIG-1335
 URL: https://issues.apache.org/jira/browse/PIG-1335
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1335-1.patch


 UDFFinder doesn't look into POCast so it will miss LoadFunc used by POCast 
 for lineage. We could see class not found exception in some cases. Here is 
 a sample script:
 {code}
 a = load '1.txt' using CustomLoader() as (a0, a1, a2);
 b = group a by a0;
 c = foreach b generate flatten(a);
 d = order c by a0;
 e = foreach d generate(a1+a2);  -- use lineage
 dump e;
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1310) ISO Date UDFs: Conversion, Trucation and Date Math

2010-03-30 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851526#action_12851526
 ] 

Alan Gates commented on PIG-1310:
-

New patch looks good.  Piggybank tests pass.  I'm rerunning the patch test to 
check things like javac warnings, etc.  As long as that all returns success 
I'll commit it.  Then I'll apply it to 0.7, test it there, and assuming all is 
well, commit it there too.

 ISO Date UDFs: Conversion, Trucation and Date Math
 --

 Key: PIG-1310
 URL: https://issues.apache.org/jira/browse/PIG-1310
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.7.0

 Attachments: joda-mavenstuff.diff, pass.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I've written UDFs to handle loading unix times, datemonth values and ISO 8601 
 formatted date strings, and working with them as ISO datetimes using jodatime.
 The working code is here: 
 http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/
 It needs to be documented and tests added, and a couple UDFs are missing, but 
 these work if you REGISTER the jodatime jar in your script.  Hopefully I can 
 get this stuff in piggybank before someone else writes it this time :)  The 
 rounding also may not be performant, but the code works.
 Ultimately I'd also like to enable support for ISO 8601 durations.  Someone 
 slap me if this isn't done soon, it is not much work and this should help 
 everyone working with time series.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Commented: (PIG-1310) ISO Date UDFs: Conversion, Trucation and Date Math

2010-03-30 Thread Russell Jurney
Cool - one thing though - Piggybank itself does not build in trunk.  It must
not have built since 0.6, since the load/store func changes went in.  Does
something need to be done there?  Should I submit a patch that removes all
the broken UDFs to make ant build in piggybank work on trunk?

To get piggybank to build, I had to remove:

!
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestMultiStorage.java
!
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestSequenceFileLoader.java
!
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestRegExLoader.java
!
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/TestPigStorageSchema.java
!
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestLookupInFiles.java
!
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalString.java

Is this just me, is this fixed on other branches?

On Tue, Mar 30, 2010 at 12:30 PM, Alan Gates (JIRA) j...@apache.org wrote:


[
 https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851526#action_12851526]

 Alan Gates commented on PIG-1310:
 -

 New patch looks good.  Piggybank tests pass.  I'm rerunning the patch test
 to check things like javac warnings, etc.  As long as that all returns
 success I'll commit it.  Then I'll apply it to 0.7, test it there, and
 assuming all is well, commit it there too.

  ISO Date UDFs: Conversion, Trucation and Date Math
  --
 
  Key: PIG-1310
  URL: https://issues.apache.org/jira/browse/PIG-1310
  Project: Pig
   Issue Type: New Feature
   Components: impl
 Reporter: Russell Jurney
 Assignee: Russell Jurney
  Fix For: 0.7.0
 
  Attachments: joda-mavenstuff.diff, pass.patch
 
Original Estimate: 168h
   Remaining Estimate: 168h
 
  I've written UDFs to handle loading unix times, datemonth values and ISO
 8601 formatted date strings, and working with them as ISO datetimes using
 jodatime.
  The working code is here:
 http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/
  It needs to be documented and tests added, and a couple UDFs are missing,
 but these work if you REGISTER the jodatime jar in your script.  Hopefully I
 can get this stuff in piggybank before someone else writes it this time :)
  The rounding also may not be performant, but the code works.
  Ultimately I'd also like to enable support for ISO 8601 durations.
  Someone slap me if this isn't done soon, it is not much work and this
 should help everyone working with time series.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




[jira] Commented: (PIG-1310) ISO Date UDFs: Conversion, Trucation and Date Math

2010-03-30 Thread Russell Jurney (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851542#action_12851542
 ] 

Russell Jurney commented on PIG-1310:
-

Cool - one thing though - Piggybank itself does not build in trunk.  It must 
not have built since 0.6, since the load/store func changes went in.  Does 
something need to be done there?  Should I submit a patch that removes all the 
broken UDFs to make ant build in piggybank work on trunk?

To get piggybank to build, I had to remove:

!   
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestMultiStorage.java
!   
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestSequenceFileLoader.java
!   
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/TestRegExLoader.java
!   
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/TestPigStorageSchema.java
!   
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/string/TestLookupInFiles.java
!   
contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/evaluation/TestEvalString.java

Is this just me, is this fixed on other branches?

 ISO Date UDFs: Conversion, Trucation and Date Math
 --

 Key: PIG-1310
 URL: https://issues.apache.org/jira/browse/PIG-1310
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.7.0

 Attachments: joda-mavenstuff.diff, pass.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I've written UDFs to handle loading unix times, datemonth values and ISO 8601 
 formatted date strings, and working with them as ISO datetimes using jodatime.
 The working code is here: 
 http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/
 It needs to be documented and tests added, and a couple UDFs are missing, but 
 these work if you REGISTER the jodatime jar in your script.  Hopefully I can 
 get this stuff in piggybank before someone else writes it this time :)  The 
 rounding also may not be performant, but the code works.
 Ultimately I'd also like to enable support for ISO 8601 durations.  Someone 
 slap me if this isn't done soon, it is not much work and this should help 
 everyone working with time series.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1313) PigServer leaks memory over time

2010-03-30 Thread Bill Graham (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Graham updated PIG-1313:
-

Status: Open  (was: Patch Available)

 PigServer leaks memory over time
 

 Key: PIG-1313
 URL: https://issues.apache.org/jira/browse/PIG-1313
 Project: Pig
  Issue Type: Bug
Reporter: Bill Graham
Assignee: Bill Graham
 Attachments: PIG-1313-0.4.0-1.patch, PIG-1313-1.patch, 
 PIG-1313-1.patch, PIG-1313-2.patch, Pig1313Reproducer.java


 When {{PigServer}} runs it creates temporary files using the 
 {{FileLocalizer.getTemporaryPath(..)}}. This static method creates and 
 returns a handle to a temporary file (as an instance of 
 {{ElementDescriptor}}). The {{ElementDescriptors}} returned by this method 
 are kept on a static {{Stack}} named {{toDelete}}. The items on {{toDelete}} 
 get removed by the {{FileLocalizer.deleteTempFile()}} method.
 The only place in the code where I see {{FileLocalizer.deleteTempFile()}} 
 called is in the Main class. {{PigServer}} does not call that method though, 
 so a long-running VM that repeatedly uses instances of {{PigServer}} to run 
 jobs will leak memory via {{toDelete}}.
 One suggested fix is to have {{PigServer.shutdown()}} call 
 {{FileLocalizer.deleteTempFile()}}, but this would cause problems in a 
 multi-threaded environment, since it seems {{ElementDescriptors}} are pushed 
 onto the {{toDelete}} stack before they're used, not once they're done with. 
 With this approach, running multiple instances of {{PigServer}} in separate 
 threads could cause one completed job to clobber the other's still-in-use 
 temp files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1313) PigServer leaks memory over time

2010-03-30 Thread Bill Graham (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Graham updated PIG-1313:
-

Status: Patch Available  (was: Open)

 PigServer leaks memory over time
 

 Key: PIG-1313
 URL: https://issues.apache.org/jira/browse/PIG-1313
 Project: Pig
  Issue Type: Bug
Reporter: Bill Graham
Assignee: Bill Graham
 Attachments: PIG-1313-0.4.0-1.patch, PIG-1313-1.patch, 
 PIG-1313-1.patch, PIG-1313-2.patch, PIG-1313-3.patch, Pig1313Reproducer.java


 When {{PigServer}} runs it creates temporary files using the 
 {{FileLocalizer.getTemporaryPath(..)}}. This static method creates and 
 returns a handle to a temporary file (as an instance of 
 {{ElementDescriptor}}). The {{ElementDescriptors}} returned by this method 
 are kept on a static {{Stack}} named {{toDelete}}. The items on {{toDelete}} 
 get removed by the {{FileLocalizer.deleteTempFile()}} method.
 The only place in the code where I see {{FileLocalizer.deleteTempFile()}} 
 called is in the Main class. {{PigServer}} does not call that method though, 
 so a long-running VM that repeatedly uses instances of {{PigServer}} to run 
 jobs will leak memory via {{toDelete}}.
 One suggested fix is to have {{PigServer.shutdown()}} call 
 {{FileLocalizer.deleteTempFile()}}, but this would cause problems in a 
 multi-threaded environment, since it seems {{ElementDescriptors}} are pushed 
 onto the {{toDelete}} stack before they're used, not once they're done with. 
 With this approach, running multiple instances of {{PigServer}} in separate 
 threads could cause one completed job to clobber the other's still-in-use 
 temp files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1313) PigServer leaks memory over time

2010-03-30 Thread Bill Graham (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Graham updated PIG-1313:
-

Attachment: PIG-1313-3.patch

 PigServer leaks memory over time
 

 Key: PIG-1313
 URL: https://issues.apache.org/jira/browse/PIG-1313
 Project: Pig
  Issue Type: Bug
Reporter: Bill Graham
Assignee: Bill Graham
 Attachments: PIG-1313-0.4.0-1.patch, PIG-1313-1.patch, 
 PIG-1313-1.patch, PIG-1313-2.patch, PIG-1313-3.patch, Pig1313Reproducer.java


 When {{PigServer}} runs it creates temporary files using the 
 {{FileLocalizer.getTemporaryPath(..)}}. This static method creates and 
 returns a handle to a temporary file (as an instance of 
 {{ElementDescriptor}}). The {{ElementDescriptors}} returned by this method 
 are kept on a static {{Stack}} named {{toDelete}}. The items on {{toDelete}} 
 get removed by the {{FileLocalizer.deleteTempFile()}} method.
 The only place in the code where I see {{FileLocalizer.deleteTempFile()}} 
 called is in the Main class. {{PigServer}} does not call that method though, 
 so a long-running VM that repeatedly uses instances of {{PigServer}} to run 
 jobs will leak memory via {{toDelete}}.
 One suggested fix is to have {{PigServer.shutdown()}} call 
 {{FileLocalizer.deleteTempFile()}}, but this would cause problems in a 
 multi-threaded environment, since it seems {{ElementDescriptors}} are pushed 
 onto the {{toDelete}} stack before they're used, not once they're done with. 
 With this approach, running multiple instances of {{PigServer}} in separate 
 threads could cause one completed job to clobber the other's still-in-use 
 temp files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1310) ISO Date UDFs: Conversion, Trucation and Date Math

2010-03-30 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851594#action_12851594
 ] 

Dmitriy V. Ryaboy commented on PIG-1310:


It builds -- you just have to build pig with the test classes first, *then* 
test piggybank.  Those Piggybank tests require some of the test helpers Pig has.

 ISO Date UDFs: Conversion, Trucation and Date Math
 --

 Key: PIG-1310
 URL: https://issues.apache.org/jira/browse/PIG-1310
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.7.0

 Attachments: joda-mavenstuff.diff, pass.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I've written UDFs to handle loading unix times, datemonth values and ISO 8601 
 formatted date strings, and working with them as ISO datetimes using jodatime.
 The working code is here: 
 http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/
 It needs to be documented and tests added, and a couple UDFs are missing, but 
 these work if you REGISTER the jodatime jar in your script.  Hopefully I can 
 get this stuff in piggybank before someone else writes it this time :)  The 
 rounding also may not be performant, but the code works.
 Ultimately I'd also like to enable support for ISO 8601 durations.  Someone 
 slap me if this isn't done soon, it is not much work and this should help 
 everyone working with time series.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1331) Owl Hadoop Table Management Service

2010-03-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1331:


Attachment: build.log

 Owl Hadoop Table Management Service
 ---

 Key: PIG-1331
 URL: https://issues.apache.org/jira/browse/PIG-1331
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Jay Tang
 Attachments: build.log, owl.contrib.3.tgz


 This JIRA is a proposal to create a Hadoop table management service: Owl. 
 Today, MapReduce and Pig applications interacts directly with HDFS 
 directories and files and must deal with low level data management issues 
 such as storage format, serialization/compression schemes, data layout, and 
 efficient data accesses, etc, often with different solutions. Owl aims to 
 provide a standard way to addresses this issue and abstracts away the 
 complexities of reading/writing huge amount of data from/to HDFS.
 Owl has a data access API that is modeled after the traditional Hadoop 
 !InputFormt and a management API to manipulate Owl objects.  This JIRA is 
 related to Pig-823 (Hadoop Metadata Service) as Owl has an internal metadata 
 store.  Owl integrates with different storage module like Zebra with a 
 pluggable architecture.
  Initially, the proposal is to submit Owl as a Pig contrib project.  Over 
 time, it makes sense to move it to a Hadoop subproject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1331) Owl Hadoop Table Management Service

2010-03-30 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851602#action_12851602
 ] 

Alan Gates commented on PIG-1331:
-

Patch as provided doesn't build.  It gets an ivy error.  I've attached a copy 
of the stdout and stderr from the build run.

 Owl Hadoop Table Management Service
 ---

 Key: PIG-1331
 URL: https://issues.apache.org/jira/browse/PIG-1331
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Jay Tang
 Attachments: build.log, owl.contrib.3.tgz


 This JIRA is a proposal to create a Hadoop table management service: Owl. 
 Today, MapReduce and Pig applications interacts directly with HDFS 
 directories and files and must deal with low level data management issues 
 such as storage format, serialization/compression schemes, data layout, and 
 efficient data accesses, etc, often with different solutions. Owl aims to 
 provide a standard way to addresses this issue and abstracts away the 
 complexities of reading/writing huge amount of data from/to HDFS.
 Owl has a data access API that is modeled after the traditional Hadoop 
 !InputFormt and a management API to manipulate Owl objects.  This JIRA is 
 related to Pig-823 (Hadoop Metadata Service) as Owl has an internal metadata 
 store.  Owl integrates with different storage module like Zebra with a 
 pluggable architecture.
  Initially, the proposal is to submit Owl as a Pig contrib project.  Over 
 time, it makes sense to move it to a Hadoop subproject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1339) International characters in column names not supported

2010-03-30 Thread Viraj Bhat (JIRA)
International characters in column names not supported
--

 Key: PIG-1339
 URL: https://issues.apache.org/jira/browse/PIG-1339
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat


There is a particular use-case in which someone specifies a column name to be 
in International characters.

{code}
inputdata = load '/user/viraj/inputdata.txt' using PigStorage() as (あいうえお);
describe inputdata;
dump inputdata;
{code}
==
Pig Stack Trace
---
ERROR 1000: Error during parsing. Lexical error at line 1, column 64.  
Encountered: \u3042 (12354), after : 

org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error at line 1, 
column 64.  Encountered: \u3042 (12354), after : 

at 
org.apache.pig.impl.logicalLayer.parser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1791)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:8959)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_51(QueryParser.java:7462)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_120(QueryParser.java:7769)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_106(QueryParser.java:7787)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_63(QueryParser.java:8609)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3R_32(QueryParser.java:8621)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:8354)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:6903)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1249)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:911)
at 
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:700)
at 
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1164)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1114)
at org.apache.pig.PigServer.registerQuery(PigServer.java:425)
at 
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:737)
at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at 
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:89)
at org.apache.pig.Main.main(Main.java:391)
==

Thanks Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1310) ISO Date UDFs: Conversion, Trucation and Date Math

2010-03-30 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-1310:


Resolution: Fixed
Status: Resolved  (was: Patch Available)

Checked into trunk and 0.7 branch.  Thanks Russell for your tireless work on 
this.

 ISO Date UDFs: Conversion, Trucation and Date Math
 --

 Key: PIG-1310
 URL: https://issues.apache.org/jira/browse/PIG-1310
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.7.0

 Attachments: joda-mavenstuff.diff, pass.patch

   Original Estimate: 168h
  Remaining Estimate: 168h

 I've written UDFs to handle loading unix times, datemonth values and ISO 8601 
 formatted date strings, and working with them as ISO datetimes using jodatime.
 The working code is here: 
 http://github.com/rjurney/oink/tree/master/src/java/oink/udf/isodate/
 It needs to be documented and tests added, and a couple UDFs are missing, but 
 these work if you REGISTER the jodatime jar in your script.  Hopefully I can 
 get this stuff in piggybank before someone else writes it this time :)  The 
 rounding also may not be performant, but the code works.
 Ultimately I'd also like to enable support for ISO 8601 durations.  Someone 
 slap me if this isn't done soon, it is not much work and this should help 
 everyone working with time series.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851651#action_12851651
 ] 

Hadoop QA commented on PIG-1338:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12440253/PIG-1338-2.patch
  against trunk revision 928950.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 79 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/271/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/271/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/271/console

This message is automatically generated.

 Pig should exclude hadoop conf in local mode
 

 Key: PIG-1338
 URL: https://issues.apache.org/jira/browse/PIG-1338
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1338-1.patch, PIG-1338-2.patch


 Currently, the behavior for hadoop conf look up is:
 * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
 conf, launch local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, still launch without warning, but many functionality will go wrong
 We should bring it to a more intuitive way, which is:
 * in local mode, always launch Pig in local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1309) Map-side Cogroup

2010-03-30 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851655#action_12851655
 ] 

Alan Gates commented on PIG-1309:
-

I'm not clear on the need for the typedComparator logic in MapReduceOper.  Can 
you explain why that's necessary?

 Map-side Cogroup
 

 Key: PIG-1309
 URL: https://issues.apache.org/jira/browse/PIG-1309
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Attachments: mapsideCogrp.patch, pig-1309_1.patch


 In never ending quest to make Pig go faster, we want to parallelize as many 
 relational operations as possible. Its already possible to do Group-by( 
 PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
 is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1340) [zebra] The zebra version number should be changed from 0.7 to 0.8

2010-03-30 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou reassigned PIG-1340:
-

Assignee: Yan Zhou

 [zebra] The zebra version number should be changed from 0.7 to 0.8
 --

 Key: PIG-1340
 URL: https://issues.apache.org/jira/browse/PIG-1340
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
Priority: Trivial



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1340) [zebra] The zebra version number should be changed from 0.7 to 0.8

2010-03-30 Thread Yan Zhou (JIRA)
[zebra] The zebra version number should be changed from 0.7 to 0.8
--

 Key: PIG-1340
 URL: https://issues.apache.org/jira/browse/PIG-1340
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Priority: Trivial




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20

2010-03-30 Thread Viraj Bhat (JIRA)
Cannot convert DataByeArray to Chararray and results in 
FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20
-

 Key: PIG-1341
 URL: https://issues.apache.org/jira/browse/PIG-1341
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.6.0
Reporter: Viraj Bhat


Script reads in BinStorage data and tries to convert a column which is in 
DataByteArray to Chararray. 

{code}
raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
--filter out null columns
A = filter raw by col1#'bcookie' is not null;

B = foreach A generate col1#'bcookie'  as reqcolumn;
describe B;
--B: {regcolumn: bytearray}
X = limit B 5;
dump X;

B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
describe B;
--B: {convertedcol: chararray}
X = limit B 5;
dump X;

{code}

The first dump produces:

(36co9b55onr8s)
(36co9b55onr8s)
(36hilul5oo1q1)
(36hilul5oo1q1)
(36l4cj15ooa8a)

The second dump produces:
()
()
()
()
()

It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
time(s).
Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED

2010-03-30 Thread Viraj Bhat (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-1341:


Component/s: impl
Summary: Cannot convert DataByeArray to Chararray and results in 
FIELD_DISCARDED_TYPE_CONVERSION_FAILED  (was: Cannot convert DataByeArray to 
Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED 20)

 Cannot convert DataByeArray to Chararray and results in 
 FIELD_DISCARDED_TYPE_CONVERSION_FAILED
 --

 Key: PIG-1341
 URL: https://issues.apache.org/jira/browse/PIG-1341
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat

 Script reads in BinStorage data and tries to convert a column which is in 
 DataByteArray to Chararray. 
 {code}
 raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
 --filter out null columns
 A = filter raw by col1#'bcookie' is not null;
 B = foreach A generate col1#'bcookie'  as reqcolumn;
 describe B;
 --B: {regcolumn: bytearray}
 X = limit B 5;
 dump X;
 B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
 describe B;
 --B: {convertedcol: chararray}
 X = limit B 5;
 dump X;
 {code}
 The first dump produces:
 (36co9b55onr8s)
 (36co9b55onr8s)
 (36hilul5oo1q1)
 (36hilul5oo1q1)
 (36l4cj15ooa8a)
 The second dump produces:
 ()
 ()
 ()
 ()
 ()
 It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
 time(s).
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1309) Map-side Cogroup

2010-03-30 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851661#action_12851661
 ] 

Ashutosh Chauhan commented on PIG-1309:
---

To build index, we sample every split and get an index entry corresponding to 
the split. After sampling all the index entries are sorted and then index is 
written to disk. When I first wrote MergeJoin I wasn't able to figure out how 
to use hadoop sorting to sort the index. So, there is a comment in MRCompiler 
for that:
{noformat}
// Sorting of index can possibly be achieved by using Hadoop sorting 
// between map and reduce instead of Pig doing sort. If that is so, 
// it will simplify lot of the code below.
{noformat}
Now I figured it out :) By default, if LocalRearranges produce key of type 
tuple Pig supplies raw binary comparators (PigTupleWritableComparator) to 
hadoop to compare tuples, which ignores the semantics of tuple. We need to 
override that behavior to make Pig supply correct version of tuple comparator 
(PigTupleRawComparator).  We need to communicate this info to 
JobControlCompiler from MRCompiler. So, I am doing the same through 
MapReduceOper object. 

As a nice side-effects of this 
a) code in MRCompiler is indeed simplified now
b) We got rid of extra index sorting inside reducer. 

 Map-side Cogroup
 

 Key: PIG-1309
 URL: https://issues.apache.org/jira/browse/PIG-1309
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Ashutosh Chauhan
Assignee: Ashutosh Chauhan
 Attachments: mapsideCogrp.patch, pig-1309_1.patch


 In never ending quest to make Pig go faster, we want to parallelize as many 
 relational operations as possible. Its already possible to do Group-by( 
 PIG-984 ) and Joins( PIG-845 , PIG-554 ) purely in map-side in Pig. This jira 
 is to add map-side implementation of Cogroup in Pig. Details to follow.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1229) allow pig to write output into a JDBC db

2010-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851665#action_12851665
 ] 

Hadoop QA commented on PIG-1229:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12440249/jira-1229-v2.patch
  against trunk revision 928950.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 4 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/260/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/260/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/260/console

This message is automatically generated.

 allow pig to write output into a JDBC db
 

 Key: PIG-1229
 URL: https://issues.apache.org/jira/browse/PIG-1229
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Ian Holsman
Assignee: Ankur
Priority: Minor
 Fix For: 0.8.0

 Attachments: jira-1229-v2.patch


 UDF to store data into a DB

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1342) [Zebra] Avoid making unnecessary name node calls for writes in Zebra

2010-03-30 Thread Chao Wang (JIRA)
[Zebra] Avoid making unnecessary name node calls for writes in Zebra


 Key: PIG-1342
 URL: https://issues.apache.org/jira/browse/PIG-1342
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.6.0, 0.7.0
Reporter: Chao Wang
Assignee: Chao Wang
 Fix For: 0.8.0


Currently, table and column group level meta data is extracted from job 
configuration object and written onto HDFS disk within checkOutputSpec(). Later 
on, writers at back end will open these files to access the meta data for doing 
writes. This puts extra load to name node since all writers need to make name 
node calls to open files. 

We propose the following approach to this problem:
For writers at back end, they extract meta information from job configuration 
object directly, rather than making name node calls and going to HDFS disk to 
fetch the information.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED

2010-03-30 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich updated PIG-1341:


Fix Version/s: 0.7.0

 Cannot convert DataByeArray to Chararray and results in 
 FIELD_DISCARDED_TYPE_CONVERSION_FAILED
 --

 Key: PIG-1341
 URL: https://issues.apache.org/jira/browse/PIG-1341
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
 Fix For: 0.7.0


 Script reads in BinStorage data and tries to convert a column which is in 
 DataByteArray to Chararray. 
 {code}
 raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
 --filter out null columns
 A = filter raw by col1#'bcookie' is not null;
 B = foreach A generate col1#'bcookie'  as reqcolumn;
 describe B;
 --B: {regcolumn: bytearray}
 X = limit B 5;
 dump X;
 B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
 describe B;
 --B: {convertedcol: chararray}
 X = limit B 5;
 dump X;
 {code}
 The first dump produces:
 (36co9b55onr8s)
 (36co9b55onr8s)
 (36hilul5oo1q1)
 (36hilul5oo1q1)
 (36l4cj15ooa8a)
 The second dump produces:
 ()
 ()
 ()
 ()
 ()
 It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
 time(s).
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1341) Cannot convert DataByeArray to Chararray and results in FIELD_DISCARDED_TYPE_CONVERSION_FAILED

2010-03-30 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1341:
---

Assignee: Richard Ding

 Cannot convert DataByeArray to Chararray and results in 
 FIELD_DISCARDED_TYPE_CONVERSION_FAILED
 --

 Key: PIG-1341
 URL: https://issues.apache.org/jira/browse/PIG-1341
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Richard Ding
 Fix For: 0.7.0


 Script reads in BinStorage data and tries to convert a column which is in 
 DataByteArray to Chararray. 
 {code}
 raw = load 'sampledata' using BinStorage() as (col1,col2, col3);
 --filter out null columns
 A = filter raw by col1#'bcookie' is not null;
 B = foreach A generate col1#'bcookie'  as reqcolumn;
 describe B;
 --B: {regcolumn: bytearray}
 X = limit B 5;
 dump X;
 B = foreach A generate (chararray)col1#'bcookie'  as convertedcol;
 describe B;
 --B: {convertedcol: chararray}
 X = limit B 5;
 dump X;
 {code}
 The first dump produces:
 (36co9b55onr8s)
 (36co9b55onr8s)
 (36hilul5oo1q1)
 (36hilul5oo1q1)
 (36l4cj15ooa8a)
 The second dump produces:
 ()
 ()
 ()
 ()
 ()
 It also throws an error message: FIELD_DISCARDED_TYPE_CONVERSION_FAILED 5 
 time(s).
 Viraj

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-03-30 Thread Viraj Bhat (JIRA)
pig_log file missing even though Main tells it is creating one and an M/R job 
fails 


 Key: PIG-1343
 URL: https://issues.apache.org/jira/browse/PIG-1343
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat


There is a particular case where I was running with the latest trunk of Pig.

{code}
$java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig

[main] INFO  org.apache.pig.Main - Logging error messages to: 
/homes/viraj/pig_1263420012601.log

$ls -l pig_1263420012601.log
ls: pig_1263420012601.log: No such file or directory
{code}

The job failed and the log file did not contain anything, the only way to debug 
was to look into the Jobtracker logs.

Here are some reasons which would have caused this behavior:
1) The underlying filer/NFS had some issues. In that case do we not error on 
stdout?
2) There are some errors from the backend which are not being captured

Viraj


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1338:


Status: Open  (was: Patch Available)

 Pig should exclude hadoop conf in local mode
 

 Key: PIG-1338
 URL: https://issues.apache.org/jira/browse/PIG-1338
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1338-1.patch, PIG-1338-2.patch, PIG-1338-3.patch


 Currently, the behavior for hadoop conf look up is:
 * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
 conf, launch local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, still launch without warning, but many functionality will go wrong
 We should bring it to a more intuitive way, which is:
 * in local mode, always launch Pig in local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1338:


Attachment: PIG-1338-3.patch

Did some code restructure and give it another shot.

 Pig should exclude hadoop conf in local mode
 

 Key: PIG-1338
 URL: https://issues.apache.org/jira/browse/PIG-1338
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1338-1.patch, PIG-1338-2.patch, PIG-1338-3.patch


 Currently, the behavior for hadoop conf look up is:
 * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
 conf, launch local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, still launch without warning, but many functionality will go wrong
 We should bring it to a more intuitive way, which is:
 * in local mode, always launch Pig in local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1338) Pig should exclude hadoop conf in local mode

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1338:


Status: Patch Available  (was: Open)

 Pig should exclude hadoop conf in local mode
 

 Key: PIG-1338
 URL: https://issues.apache.org/jira/browse/PIG-1338
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1338-1.patch, PIG-1338-2.patch, PIG-1338-3.patch


 Currently, the behavior for hadoop conf look up is:
 * in local mode, if there is hadoop conf, bail out; if there is no hadoop 
 conf, launch local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, still launch without warning, but many functionality will go wrong
 We should bring it to a more intuitive way, which is:
 * in local mode, always launch Pig in local mode
 * in hadoop mode, if there is hadoop conf, use this conf to launch Pig; if 
 no, bail out with a meaningful message

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1336) Optimize POStore serialized into JobConf

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1336:


Status: Patch Available  (was: Open)

Good catch. Thanks Richard. Modify the patch to address that.

 Optimize POStore serialized into JobConf
 

 Key: PIG-1336
 URL: https://issues.apache.org/jira/browse/PIG-1336
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1336-1.patch, PIG-1336-2.patch, PIG-1336-3.patch, 
 PIG-1336-4.patch


 We serialize POStore too early in the JobControlCompiler. At that time, 
 storeFunc have unconstraint link to other operator; in the worst case, it 
 will chain the whole physical plan. Also, in multi-store case, POStore has 
 link to its data source, which is not needed and will increase the footprint 
 of serialized POStore. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1336) Optimize POStore serialized into JobConf

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1336:


Attachment: PIG-1336-4.patch

 Optimize POStore serialized into JobConf
 

 Key: PIG-1336
 URL: https://issues.apache.org/jira/browse/PIG-1336
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1336-1.patch, PIG-1336-2.patch, PIG-1336-3.patch, 
 PIG-1336-4.patch


 We serialize POStore too early in the JobControlCompiler. At that time, 
 storeFunc have unconstraint link to other operator; in the worst case, it 
 will chain the whole physical plan. Also, in multi-store case, POStore has 
 link to its data source, which is not needed and will increase the footprint 
 of serialized POStore. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1336) Optimize POStore serialized into JobConf

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1336:


Status: Open  (was: Patch Available)

 Optimize POStore serialized into JobConf
 

 Key: PIG-1336
 URL: https://issues.apache.org/jira/browse/PIG-1336
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Attachments: PIG-1336-1.patch, PIG-1336-2.patch, PIG-1336-3.patch, 
 PIG-1336-4.patch


 We serialize POStore too early in the JobControlCompiler. At that time, 
 storeFunc have unconstraint link to other operator; in the worst case, it 
 will chain the whole physical plan. Also, in multi-store case, POStore has 
 link to its data source, which is not needed and will increase the footprint 
 of serialized POStore. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1344) PigStorage should be able to read back complex data containing delimiters created by PigStorage

2010-03-30 Thread Santhosh Srinivasan (JIRA)
PigStorage should be able to read back complex data containing delimiters 
created by PigStorage
---

 Key: PIG-1344
 URL: https://issues.apache.org/jira/browse/PIG-1344
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Santhosh Srinivasan
Assignee: Daniel Dai
 Fix For: 0.8.0


With Pig 0.7, the TextDataParser has been removed and the logic to parse 
complex data types has moved to Utf8StorageConverter. However, this does not 
handle the case where the complex data types could contain delimiters ('{', 
'}', ',', '(', ')', '[', ']', '#'). Fixing this issue will make PigStorage self 
contained and more usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1313) PigServer leaks memory over time

2010-03-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851717#action_12851717
 ] 

Hadoop QA commented on PIG-1313:


-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12440276/PIG-1313-3.patch
  against trunk revision 929236.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/272/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/272/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/272/console

This message is automatically generated.

 PigServer leaks memory over time
 

 Key: PIG-1313
 URL: https://issues.apache.org/jira/browse/PIG-1313
 Project: Pig
  Issue Type: Bug
Reporter: Bill Graham
Assignee: Bill Graham
 Attachments: PIG-1313-0.4.0-1.patch, PIG-1313-1.patch, 
 PIG-1313-1.patch, PIG-1313-2.patch, PIG-1313-3.patch, Pig1313Reproducer.java


 When {{PigServer}} runs it creates temporary files using the 
 {{FileLocalizer.getTemporaryPath(..)}}. This static method creates and 
 returns a handle to a temporary file (as an instance of 
 {{ElementDescriptor}}). The {{ElementDescriptors}} returned by this method 
 are kept on a static {{Stack}} named {{toDelete}}. The items on {{toDelete}} 
 get removed by the {{FileLocalizer.deleteTempFile()}} method.
 The only place in the code where I see {{FileLocalizer.deleteTempFile()}} 
 called is in the Main class. {{PigServer}} does not call that method though, 
 so a long-running VM that repeatedly uses instances of {{PigServer}} to run 
 jobs will leak memory via {{toDelete}}.
 One suggested fix is to have {{PigServer.shutdown()}} call 
 {{FileLocalizer.deleteTempFile()}}, but this would cause problems in a 
 multi-threaded environment, since it seems {{ElementDescriptors}} are pushed 
 onto the {{toDelete}} stack before they're used, not once they're done with. 
 With this approach, running multiple instances of {{PigServer}} in separate 
 threads could cause one completed job to clobber the other's still-in-use 
 temp files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1344) PigStorage should be able to read back complex data containing delimiters created by PigStorage

2010-03-30 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12851729#action_12851729
 ] 

Daniel Dai commented on PIG-1344:
-

Hi, Santhosh,
We change complex data parsing in 0.7, and all values will be read as 
bytearray. The goal for this change is to stop guessing datatype for complex 
data type. You can cast to other datatype either implicitly or explicitly. So 
here do you mean you still want data type guessing in some cases?

 PigStorage should be able to read back complex data containing delimiters 
 created by PigStorage
 ---

 Key: PIG-1344
 URL: https://issues.apache.org/jira/browse/PIG-1344
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0
Reporter: Santhosh Srinivasan
Assignee: Daniel Dai
 Fix For: 0.8.0


 With Pig 0.7, the TextDataParser has been removed and the logic to parse 
 complex data types has moved to Utf8StorageConverter. However, this does not 
 handle the case where the complex data types could contain delimiters ('{', 
 '}', ',', '(', ')', '[', ']', '#'). Fixing this issue will make PigStorage 
 self contained and more usable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1330) Move pruned schema tracking logic from LoadFunc to core code

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1330:


Attachment: PIG-1330-1.patch

 Move pruned schema tracking logic from LoadFunc to core code
 

 Key: PIG-1330
 URL: https://issues.apache.org/jira/browse/PIG-1330
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: PIG-1330-1.patch


 Currently, LoadFunc.getSchema require a schema after column pruning. The good 
 side of this is LoadFunc.getSchema matches the data it actually load. This 
 gives a sense of consistency. However, by doing this, every LoadFunc need to 
 keep track of the columns pruned. This is an unnecessary burden to the 
 LoadFunc writer and it is very error proning. This issue is to move this 
 logic from LoadFunc to Pig core. LoadFunc.getSchema then only need to return 
 original schema even after pruning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1330) Move pruned schema tracking logic from LoadFunc to core code

2010-03-30 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1330:


Status: Patch Available  (was: Open)

 Move pruned schema tracking logic from LoadFunc to core code
 

 Key: PIG-1330
 URL: https://issues.apache.org/jira/browse/PIG-1330
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.7.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.7.0

 Attachments: PIG-1330-1.patch


 Currently, LoadFunc.getSchema require a schema after column pruning. The good 
 side of this is LoadFunc.getSchema matches the data it actually load. This 
 gives a sense of consistency. However, by doing this, every LoadFunc need to 
 keep track of the columns pruned. This is an unnecessary burden to the 
 LoadFunc writer and it is very error proning. This issue is to move this 
 logic from LoadFunc to Pig core. LoadFunc.getSchema then only need to return 
 original schema even after pruning.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.