date:20100730

[jira] Commented: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-07-30 Thread Daniel Dai (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12893959#action_12893959
 ] 

Daniel Dai commented on PIG-1452:
-

Or we can include an exclusion list. We will exclude some known jars (such as 
jython.jar, etc) in pig.jar.

 to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
 --

 Key: PIG-1452
 URL: https://issues.apache.org/jira/browse/PIG-1452
 Project: Pig
  Issue Type: Improvement
  Components: build
Affects Versions: 0.8.0
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Fix For: 0.8.0

 Attachments: PIG-1452.PATCH, PIG-1452V2.PATCH


 pig use ivy for dependency management. But still it uses hadoop20.jar from 
 the lib folder. 
 Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
 should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1526) HiveColumnarLoader Partitioning Support

2010-07-30 Thread Gerrit Jansen van Vuuren (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gerrit Jansen van Vuuren updated PIG-1526:
--

Status: Patch Available (was: Open)
Tags: PIG-1526.patch

HiveColumnarLoader Partitioning Support
---

Key: PIG-1526
URL: https://issues.apache.org/jira/browse/PIG-1526
Project: Pig
Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Gerrit Jansen van Vuuren
Assignee: Gerrit Jansen van Vuuren
Fix For: 0.8.0

Attachments: PIG-1526.patch

I've made allot improvements on the HiveColumnarLoader:
- Added support for LoadMetadata and data path Partitioning
- Improved and simplefied column loading
Data Path Partitioning:
Hive stores partitions as folders like to
/mytable/partition1=[value]/partition2=[value]. That is the table mytable
contains 2 partitions [partition1, partition2].
The HiveColumnarLoader will scan the inputpath /mytable and add to the
PigSchema the columns partition2 and partition2.
These columns can then be used in filtering.
For example: We've got year,month,day,hour partitions in our data uploads.
So a table might look like mytable/year=2010/month=02/day=01.
Loading with the HiveColumnarLoader allows our pig scripts do filter by date
using the standard pig Filter operator.
I've added 2 classes for this:
- PathPartitioner
- PathPartitionHelper
These classes are not hive dependent and could be used by any other loader
that wants to support partitioning and helps with implementing the
LoadMetadata interface.
For this reason I though it best to put it into the package
org.apache.pig.piggybank.storage.partition.
What would be nice is in the future have the PigStorage also use these 2
classes to provide automatic path partitioning support.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1526) HiveColumnarLoader Partitioning Support

2010-07-30 Thread Gerrit Jansen van Vuuren (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Gerrit Jansen van Vuuren updated PIG-1526:
--

Priority: Minor (was: Major)
Description:
I've made allot improvements on the HiveColumnarLoader:
- Added support for LoadMetadata and data path Partitioning
- Improved and simplefied column loading

Data Path Partitioning:

Hive stores partitions as folders like to
/mytable/partition1=[value]/partition2=[value]. That is the table mytable
contains 2 partitions [partition1, partition2].
The HiveColumnarLoader will scan the inputpath /mytable and add to the
PigSchema the columns partition2 and partition2.
These columns can then be used in filtering.
For example: We've got year,month,day,hour partitions in our data uploads.
So a table might look like mytable/year=2010/month=02/day=01.
Loading with the HiveColumnarLoader allows our pig scripts do filter by date
using the standard pig Filter operator.

I've added 2 classes for this:
- PathPartitioner
- PathPartitionHelper

These classes are not hive dependent and could be used by any other loader that
wants to support partitioning and helps with implementing the LoadMetadata
interface.
For this reason I though it best to put it into the package
org.apache.pig.piggybank.storage.partition.
What would be nice is in the future have the PigStorage also use these 2
classes to provide automatic path partitioning support.

was:

I've made allot improvements on the HiveColumnarLoader:
- Added support for LoadMetadata and data path Partitioning
- Improved and simplefied column loading

Data Path Partitioning:

I've added 2 classes for this:
- PathPartitioner
- PathPartitionHelper

HiveColumnarLoader Partitioning Support
---

Attachments: PIG-1526.patch

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

2010-07-30 Thread Daniel Dai (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894075#action_12894075
]

Daniel Dai commented on PIG-1343:
-

Hi, Ashitosh,
As you know, Pig 0.8 will code freeze by the end of August. Are you able to
finish the patch by the mid of Aug? Thanks.

pig_log file missing even though Main tells it is creating one and an M/R job
fails

Key: PIG-1343
URL: https://issues.apache.org/jira/browse/PIG-1343
Project: Pig
Issue Type: Bug
Components: impl
Affects Versions: 0.6.0
Reporter: Viraj Bhat
Assignee: Ashitosh Darbarwar
Fix For: 0.8.0

Attachments: PIG-1343-1.patch

There is a particular case where I was running with the latest trunk of Pig.
{code}
$java -cp pig.jar:/home/path/hadoop20cluster org.apache.pig.Main testcase.pig
[main] INFO org.apache.pig.Main - Logging error messages to:
/homes/viraj/pig_1263420012601.log
$ls -l pig_1263420012601.log
ls: pig_1263420012601.log: No such file or directory
{code}
The job failed and the log file did not contain anything, the only way to
debug was to look into the Jobtracker logs.
Here are some reasons which would have caused this behavior:
1) The underlying filer/NFS had some issues. In that case do we not error on
stdout?
2) There are some errors from the backend which are not being captured
Viraj

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1513) Pig doesn't handle empty input directory

2010-07-30 Thread Richard Ding (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1513:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
  Resolution: Fixed

 Pig doesn't handle empty input directory
 

 Key: PIG-1513
 URL: https://issues.apache.org/jira/browse/PIG-1513
 Project: Pig
  Issue Type: Bug
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1513.patch


 The following script
 {code}
 A = load 'input';
 B = load 'emptydir';
 C = join B by $0, A by $0 using 'skewed';
 store C into 'output';
 {code}
 fails with ERROR: java.lang.RuntimeException: Empty samples file';
 In this case, the sample job has 0 maps.  Pig doesn't expect this and fails . 
 For merge join the script
 The merge join script
 {code}
 A = load 'input';
 B = load 'emptydir';
 C = join A by $0, B by $0 using 'merge';
 store C into 'output';
 {code}
 the sample job again has 0 maps and the script  fails with  ERROR 2176: 
 Error processing right input during merge join.
 But if we change the join order: 
 {code}
 A = load 'input';
 B = load 'emptydir';
 C = join B by $0, A by $0 using 'merge';
 store C into 'output';
 {code}
 The second job (merge) now has 0 maps and 0 reduces. And it generates an 
 empty 'output' directory.
 Order by on empty directory works fine and generates empty part files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

2010-07-30 Thread Konstantin Boudnik (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894077#action_12894077
 ] 

Konstantin Boudnik commented on PIG-1452:
-

Normally, it isn't a good idea at all to include any jars into a production jar 
file. If you need extra jars to be delivered they are better be delivered 
separately, i.e. through Ivy dependencies. So, I believe the build needs to be 
changed a bit especially in this line
{noformat}
+zipgroupfileset dir=${ivy.lib.dir} includes=*.jar /
{noformat}
While I am not suggesting how Pig needs to deliver its stuff having dependency 
jars packed into the same gigantic jar file doesn't seem right. 

 to remove hadoop20.jar from lib and use hadoop from the apache maven repo.
 --

 Key: PIG-1452
 URL: https://issues.apache.org/jira/browse/PIG-1452
 Project: Pig
  Issue Type: Improvement
  Components: build
Affects Versions: 0.8.0
Reporter: Giridharan Kesavan
Assignee: Giridharan Kesavan
 Fix For: 0.8.0

 Attachments: PIG-1452.PATCH, PIG-1452V2.PATCH


 pig use ivy for dependency management. But still it uses hadoop20.jar from 
 the lib folder. 
 Now that we have the hadoop-0.20.2 artifacts available in the maven repo, pig 
 should leverage ivy for resolving/retrieving hadoop artifacts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-07-30 Thread Yan Zhou (JIRA)

[
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12894205#action_12894205
]

Yan Zhou commented on PIG-1518:
---

CombinedInputFormat, in lieu of the deprecated MultiFileInputFomrat, batches
small files on the basis of block locality. For PIG, this umbrella input format
will have to work with the generic input formats for which the block info is
unavailable but the data node and size info are present to let the M/R make
scheduling decisions. In other words, PIG can not
break the original splits to work inside but can just use the original splits
as building block for the combined input splits.

Consequently, this combine input format will be holding multiple generic input
splits so that each combined split's size is bound by a configured limit of,
say, pig.maxsplitsize, with the default value of the HDFS block size of the
file system the load source sits in.

However, due to the constrains of sortness in the tables in merge join, the
split combination will not be used for any loads that will be used in merge
join. For mapside cogroup or mapside group by, though, the splits can be
combined because the splits are only required to contain the all duplicate keys
per instance and combination of splits will still preserve that invariant.

During combination, the splits on the same data nodes will be merged as much as
possible. Leftovers will be merged without regarding to the data localities. Of
all the used data nodes, those of less splits will be merged before considering
those of more splits so as to minimize the leftovers on the data nodes of less
splits. On each data node, a greedy approach is adopted so that largest splits
are tried to be merged before smaller ones. This is because smaller splits are
easier merged later among themselves.
As result, in implementation, a sorted list of data hosts (on the number of
splits) of sorted lists (on the split size) of the original splits will be
maintained to efficiently perform the above operations. The complexity should
be linear with the number of the original splits.

Note that for data locality, we just honor whatever the generic input split's
getLocations() method produces. Any particular input split's implementation
actually may or may not hold that property. For instance, CombinedInputFormat
will combine
node-local or rack-local blocks into a split. Essentially, this PIG container
input split works on whatever data locality perception the underlying loader
provides.

On the implementation side, PigSplit will not hold a single wrapped InputSplit
instance but a new CombinedInputSplit instance. Accordingly, PigRecordReader
will hold a list
of wrapped record readers and not just a single one. Correspondingly
PigRecordReader's nextKeyValue() will use the wrapped record reader in order to
fetch the next values.

Risks include 1) the test verifications may need major changes since this
optimization may cause major ordering changes in results; 2) since
LoadFunc.prepareRead() takes a PigSplit argument, there might be a backward
compatibility issue as PigSplit changes its wrapped input split to the combined
input split. But this should be very unlikely as the only known
use of the PigSplit argument is the internal index loader for the right
table in merge join.

multi file input format for loaders
---

Key: PIG-1518
URL: https://issues.apache.org/jira/browse/PIG-1518
Project: Pig
Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
Fix For: 0.8.0

We frequently run in the situation where Pig needs to deal with small files
in the input. In this case a separate map is created for each file which
could be very inefficient.
It would be greate to have an umbrella input format that can take multiple
files and use them in a single split. We would like to see this working with
different data formats if possible.
There are already a couple of input formats doing similar thing:
MultifileInputFormat as well as CombinedInputFormat; howevere, neither works
with ne Hadoop 20 API.
We at least want to do a feasibility study for Pig 0.8.0.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

[jira] Updated: (PIG-1526) HiveColumnarLoader Partitioning Support

[jira] Updated: (PIG-1526) HiveColumnarLoader Partitioning Support

[jira] Commented: (PIG-1343) pig_log file missing even though Main tells it is creating one and an M/R job fails

[jira] Updated: (PIG-1513) Pig doesn't handle empty input directory

[jira] Commented: (PIG-1452) to remove hadoop20.jar from lib and use hadoop from the apache maven repo.

[jira] Commented: (PIG-1518) multi file input format for loaders

7 matches

Site Navigation

Mail list logo

Footer information