date:20090804


 [ 
https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-901:
---

Attachment: PIG-901-trunk.patch

 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext
 

 Key: PIG-901
 URL: https://issues.apache.org/jira/browse/PIG-901
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.4.0

 Attachments: PIG-901-1.patch, PIG-901-branch-0.3.patch, 
 PIG-901-trunk.patch


 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext. SliceWrapper only needs ExecType - so the entire PigContext 
 should not be serialized and only the ExecType should be serialized.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext


 [ 
https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Kamath updated PIG-901:
---

Status: Patch Available  (was: Open)

PIG-901-trunk.patch is for the trunk. The change is in SliceWrapper to 
serialize ExecType only instead of PigContext since only the ExecType from the 
PigContext is used on deserialization. The package import list which Daniel 
referred to is a static member of PigContext which is explicitly set in 
SliceWrapper.makeRecordReader() and hence is taken care of.

It is a good suggestion to include a test case to check that even with a 
sizeable PigContext, we actually create small input splits. However to do this 
in the current Pig code layout means opening up PigServer and 
JobControlCompiler so that we can compile a pig script upto job creation and 
then instead of submitting the job to hadoop, instatiate PigInputFormat with 
the jobConf and get the Input Splits. This may require some design changes 
which we should address at some point for these kinds of tests. For now there 
is regression test in the patch to ensure the package import list is correctly 
handled and we have manually tested to ensure the split size is small (order of 
KBs).

 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext
 

 Key: PIG-901
 URL: https://issues.apache.org/jira/browse/PIG-901
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.4.0

 Attachments: PIG-901-1.patch, PIG-901-branch-0.3.patch, 
 PIG-901-trunk.patch


 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext. SliceWrapper only needs ExecType - so the entire PigContext 
 should not be serialized and only the ExecType should be serialized.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext

2009-08-04 Thread Arun C Murthy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739048#action_12739048
 ] 

Arun C Murthy commented on PIG-901:
---

bq. This may require some design changes which we should address at some point 
for these kinds of tests.

Could you please track this with a new jira? Thanks!

 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext
 

 Key: PIG-901
 URL: https://issues.apache.org/jira/browse/PIG-901
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.4.0

 Attachments: PIG-901-1.patch, PIG-901-branch-0.3.patch, 
 PIG-901-trunk.patch


 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext. SliceWrapper only needs ExecType - so the entire PigContext 
 should not be serialized and only the ExecType should be serialized.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-906) Need a way to test integration points with Hadoop from unit tests

Need a way to test integration points with Hadoop from unit tests
-

 Key: PIG-906
 URL: https://issues.apache.org/jira/browse/PIG-906
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Priority: Minor


Currently there is no easy mechanisim from unit tests to get hold of the 
compiled JobConf (or Job) for a script from a unit test testcase. This may 
require some design changes like having public methods in PigServer and 
JobControlCompiler to be able to compile a script upto launch and then get hold 
of the JobConf or Job to ensure things are set up right. The need for this 
showed up in PIG-901 as described in 
https://issues.apache.org/jira/browse/PIG-901?focusedCommentId=12739044page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12739044.
 That use case can be used as one of the requirements for the design change.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-901) InputSplit (SliceWrapper) created by Pig is big in size due to serialized PigContext


[ 
https://issues.apache.org/jira/browse/PIG-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739056#action_12739056
 ] 

Pradeep Kamath commented on PIG-901:


https://issues.apache.org/jira/browse/PIG-906 has been created to track changes 
to enable unit testing these types of hadoop integration scenarios.

 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext
 

 Key: PIG-901
 URL: https://issues.apache.org/jira/browse/PIG-901
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.3.1
Reporter: Pradeep Kamath
Assignee: Pradeep Kamath
 Fix For: 0.4.0

 Attachments: PIG-901-1.patch, PIG-901-branch-0.3.patch, 
 PIG-901-trunk.patch


 InputSplit (SliceWrapper) created by Pig is big in size due to serialized 
 PigContext. SliceWrapper only needs ExecType - so the entire PigContext 
 should not be serialized and only the ExecType should be serialized.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-907) Provide multiple version of HashFNV (Piggybank)


 [ 
https://issues.apache.org/jira/browse/PIG-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-907:
---

Attachment: PIG-907-1.patch

 Provide multiple version of HashFNV (Piggybank)
 ---

 Key: PIG-907
 URL: https://issues.apache.org/jira/browse/PIG-907
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Priority: Minor
 Fix For: 0.4.0

 Attachments: PIG-907-1.patch


 HashFNV takes 1 or 2 parameters. It is better to create 2 versions of HashFNV 
 when PIG-902 is not solved. So we can let the Pig pick the right version, do 
 the type cast. Otherwise, user have to do the explicit cast. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-908) Need a way to correlate MR jobs with Pig statements

Need a way to correlate MR jobs with Pig statements
---

 Key: PIG-908
 URL: https://issues.apache.org/jira/browse/PIG-908
 Project: Pig
  Issue Type: Wish
Reporter: Dmitriy V. Ryaboy


Complex Pig Scripts often generate many Map-Reduce jobs, especially with the 
recent introduction of multi-store capabilities.
For example, the first script in the Pig tutorial produces 5 MR jobs.

There is currently very little support for debugging resulting jobs; if one of 
the MR jobs fails, it is hard to figure out which part of the script it was 
responsible for. Explain plans help, but even with the explain plan, a fair 
amount of effort (and sometimes, experimentation) is required to correlate the 
failing MR job with the corresponding PigLatin statements.

This ticket is created to discuss approaches to alleviating this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-908) Need a way to correlate MR jobs with Pig statements

[
https://issues.apache.org/jira/browse/PIG-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739125#action_12739125
]

Dmitriy V. Ryaboy commented on PIG-908:
---

An idea for something might work (haven't evaluated the complexity of
implementing this)

When LogicalOperators are created, a bit of metadata is attached to them,
listing the line number that they come from. Multiple LOs may be created from
a single line, and multiple lines may be associated with a single operator.

This metadata is passed down to Physical Operators.

When an MR job is created, a log message is written listing the line numbers
that are associated with the POs in this map-reduce job, and the job name.

Thoughts?

Need a way to correlate MR jobs with Pig statements
---

Key: PIG-908
URL: https://issues.apache.org/jira/browse/PIG-908
Project: Pig
Issue Type: Wish
Reporter: Dmitriy V. Ryaboy

Complex Pig Scripts often generate many Map-Reduce jobs, especially with the
recent introduction of multi-store capabilities.
For example, the first script in the Pig tutorial produces 5 MR jobs.
There is currently very little support for debugging resulting jobs; if one
of the MR jobs fails, it is hard to figure out which part of the script it
was responsible for. Explain plans help, but even with the explain plan, a
fair amount of effort (and sometimes, experimentation) is required to
correlate the failing MR job with the corresponding PigLatin statements.
This ticket is created to discuss approaches to alleviating this problem.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-908) Need a way to correlate MR jobs with Pig statements

2009-08-04 Thread Santhosh Srinivasan (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739147#action_12739147
 ] 

Santhosh Srinivasan commented on PIG-908:
-

+1

This approach has been discussed but not documented.

 Need a way to correlate MR jobs with Pig statements
 ---

 Key: PIG-908
 URL: https://issues.apache.org/jira/browse/PIG-908
 Project: Pig
  Issue Type: Wish
Reporter: Dmitriy V. Ryaboy

 Complex Pig Scripts often generate many Map-Reduce jobs, especially with the 
 recent introduction of multi-store capabilities.
 For example, the first script in the Pig tutorial produces 5 MR jobs.
 There is currently very little support for debugging resulting jobs; if one 
 of the MR jobs fails, it is hard to figure out which part of the script it 
 was responsible for. Explain plans help, but even with the explain plan, a 
 fair amount of effort (and sometimes, experimentation) is required to 
 correlate the failing MR job with the corresponding PigLatin statements.
 This ticket is created to discuss approaches to alleviating this problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-907) Provide multiple version of HashFNV (Piggybank)


 [ 
https://issues.apache.org/jira/browse/PIG-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-907:
---

Status: Patch Available  (was: Open)

 Provide multiple version of HashFNV (Piggybank)
 ---

 Key: PIG-907
 URL: https://issues.apache.org/jira/browse/PIG-907
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Priority: Minor
 Fix For: 0.4.0

 Attachments: PIG-907-1.patch


 HashFNV takes 1 or 2 parameters. It is better to create 2 versions of HashFNV 
 when PIG-902 is not solved. So we can let the Pig pick the right version, do 
 the type cast. Otherwise, user have to do the explicit cast. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-907) Provide multiple version of HashFNV (Piggybank)


 [ 
https://issues.apache.org/jira/browse/PIG-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-907:
---

Attachment: PIG-907-2.patch

Changed the patch to include license and more decent error handling. Thanks 
Thejas to point out.

 Provide multiple version of HashFNV (Piggybank)
 ---

 Key: PIG-907
 URL: https://issues.apache.org/jira/browse/PIG-907
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.3.0
Reporter: Daniel Dai
Priority: Minor
 Fix For: 0.4.0

 Attachments: PIG-907-1.patch, PIG-907-2.patch


 HashFNV takes 1 or 2 parameters. It is better to create 2 versions of HashFNV 
 when PIG-902 is not solved. So we can let the Pig pick the right version, do 
 the type cast. Otherwise, user have to do the explicit cast. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig


 [ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-909:
--

Attachment: pig_909.patch

The attached patch modifies bin/pig as described.

Tested locally by setting and unsetting HADOOP_HOME and making sure the right 
configurations, etc, are picked up.

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig


[ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739244#action_12739244
 ] 

Daniel Dai commented on PIG-909:


Seems like bin/pig is broken for a while. Some libraries have been moved to 
build/ivy/lib/Pig, and pig script does not take care of it correctly.

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig


 [ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-909:
--

Attachment: pig_909.2.patch

added ivy jars to classpath

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.2.patch, pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig


[ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739269#action_12739269
 ] 

Daniel Dai commented on PIG-909:


Hi, Dmitriy,
One problem is that hadoop.jar comes with pig actually bundles lots of external 
libraries needed by hadoop such as log4j, common-logging. If we skip hadoop.jar 
and use external one, we miss all those libraries. Can we try this? If we have 
external hadoop.jar, put it in front of pig.jar in classpath. So java will pick 
classes in external hadoop.jar first.

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.2.patch, pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig


[ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739282#action_12739282
 ] 

Daniel Dai commented on PIG-909:


Yes, Dmitriy, you said it. However, if we do not have external hadoop, pig 
script do not currently work. We need to fix it.

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.2.patch, pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig


[ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739287#action_12739287
 ] 

Dmitriy V. Ryaboy commented on PIG-909:
---

Daniel, not sure what you mean.
Do you mean that the patch makes it necessary to have an external version of 
hadoop to build/run pig?
That's not the case, as I wrapped the whole thing in an if -- external hadoop 
jars will only be used instead of the bundled hadoop.jar if HADOOP_HOME is 
defined (and valid).

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.2.patch, pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig


[ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739292#action_12739292
 ] 

Daniel Dai commented on PIG-909:


Hi, Dmitriy, 
It does not related to the patch. What I mean is pig script in trunk is not 
working correctly even before patch.

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.2.patch, pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig


[ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12739297#action_12739297
 ] 

Dmitriy V. Ryaboy commented on PIG-909:
---

Actually I looked at build.xml for pig, and it includes the Ivy dependencies in 
pig.jar

Which explains why this stuff has been working for me.

I'll delete the second patch -- that change is unnecessary.

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-909) Allow Pig executable to use hadoop jars not bundled with pig


 [ 
https://issues.apache.org/jira/browse/PIG-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-909:
--

Attachment: (was: pig_909.2.patch)

 Allow Pig executable to use hadoop jars not bundled with pig
 

 Key: PIG-909
 URL: https://issues.apache.org/jira/browse/PIG-909
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
Priority: Minor
 Attachments: pig_909.patch


 The current pig executable (bin/pig) looks for a file named 
 hadoop${PIG_HADOOP_VERSION}.jar that comes bundled with Pig.
 The proposed change will allow Pig to look in $HADOOP_HOME for the hadoop 
 jars, if that variable is set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-660) Integration with Hadoop 0.20