[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-11 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897368#action_12897368
 ] 

Alan Gates commented on PIG-1518:
-

bq. For mapside cogroup or mapside group by, though, the splits can be combined 
because the splits are only required to contain the all duplicate keys per 
instance and combination of splits will still preserve that invariant.

You are correct for mapside group, but not mapside cogroup.  Mapside cogroup 
does require all files being grouped to be processed in an ordered fashion.  

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2010-08-11 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-965:
--

Status: Patch Available  (was: Reopened)

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Fix For: 0.8.0

 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2010-08-11 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-965:
--

Fix Version/s: 0.8.0

Ankit is right, the patch is not present in trunk. I will apply it to trunk.

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
 Fix For: 0.8.0

 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-965) PERFORMANCE: optimize common case in matches (PORegex)

2010-08-11 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair reassigned PIG-965:
-

Assignee: Ankit Modi

 PERFORMANCE: optimize common case in matches (PORegex)
 --

 Key: PIG-965
 URL: https://issues.apache.org/jira/browse/PIG-965
 Project: Pig
  Issue Type: Improvement
  Components: impl
Reporter: Thejas M Nair
Assignee: Ankit Modi
 Fix For: 0.8.0

 Attachments: automaton.jar, poregex2.patch


 Some frequently seen use cases of 'matches' comparison operator have follow 
 properties -
 1. The rhs is a constant string . eg c1 matches 'abc%' 
 2. Regexes such that look for matching prefix , suffix etc are very common. 
 eg - abc%', %abc, '%abc%' 
 To optimize for these common cases , PORegex.java can be changed to -
 1. Compile the pattern (rhs of matches) re-use it if the pattern string has 
 not changed. 
 2. Use string comparisons for simple common regexes (in 2 above).
 The implementation of Hive like clause uses similar optimizations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-11 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897451#action_12897451
 ] 

Richard Ding commented on PIG-1458:
---

The proposal is to run another map-reduce job to merge the small files before 
the replicated join. This additional job will be added to the MR plan at the 
compile time.

We consider three cases of a replicated join: 

# The right input is a map-only job and input files exist at the compile time.
# The right input is a map-only job and input files do not exist at the compile 
time.
# The right input is a map-reduce job.

For 1., if the number of files exceeds the threshold specified in the property 
file (_pig.frjoin.merge.files.threshold_), a merge job is added between right 
input job and FR join job.

For 3., if the number of reducers exceeds the threshold specified in the 
property file (_pig.frjoin.merge.files.threshold_), a merge job is added 
between right input job and FR join job.

For 2., if the flag specified in the property file 
(_pig.frjoin.merge.files.optimistic_) is false,  a merge job is added between 
right input job and FR join job. The default value of this flag is false. 



 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-103) Shared Job /tmp location should be configurable

2010-08-11 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-103:
-

Tags: documentation

 Shared Job /tmp location should be configurable
 ---

 Key: PIG-103
 URL: https://issues.apache.org/jira/browse/PIG-103
 Project: Pig
  Issue Type: Improvement
  Components: impl
 Environment: Partially shared file:// filesystem (eg NFS)
Reporter: Craig Macdonald
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: conf_tmp_dir.patch, conf_tmp_dir_2.patch


 Hello,
 I'm investigating running pig in an environment where various parts of the 
 file:// filesystem are available on all nodes. I can tell hadoop to use a 
 file:// file system location for it's default, by seting 
 fs.default.name=file://path/to/shared/folder
 However, this creates issues for Pig, as Pig writes it's job information in a 
 folder that it assumes is a shared FS (eg DFS). However, in this scenario 
 /tmp is not shared on each machine.
 So /tmp should either be configurable, or Hadoop should tell you the actual 
 full location set in fs.default.name?
 Straightforward solution is to make /tmp/ a property in 
 src/org/apache/pig/impl/io/FileLocalizer.java init(PigContext)
 Any suggestions of property names?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-11 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897455#action_12897455
 ] 

Thejas M Nair commented on PIG-1501:


Why was TFile chosen over SequenceFile ? I am wondering if the additional 
unused features of TFile (index, metadata) result in any overhead compared to 
SequenceFile. 


 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1458) aggregate files for replicated join

2010-08-11 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897484#action_12897484
 ] 

Richard Ding commented on PIG-1458:
---

For 1. and 2. above, another approach is to do nothing and rely on 
MultiFileInputFormat (PIG-1518) to merge small files. 

 aggregate files for replicated join
 ---

 Key: PIG-1458
 URL: https://issues.apache.org/jira/browse/PIG-1458
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0


 We have noticed that if the smaller data in replicated join has many files, 
 this puts  unneeded burden on the name node. pre-aggregating the files can 
 improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-11 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897493#action_12897493
 ] 

Yan Zhou commented on PIG-1518:
---

Right, map side cogroup needs the sortness of the input, but just the side 
inputs need the feature to be able to seek on a key; the base input will 
only need presence of all duplicate keys in a mapper. I'll mark the side 
inputs as non-combinable.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1501) need to investigate the impact of compression on pig performance

2010-08-11 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12897496#action_12897496
 ] 

Yan Zhou commented on PIG-1501:
---

Please refer to HADOOP-3315 for overall Sequence File vs TFile comparison. It 
appears for compressed data, TFile performs better than SeqFile.

 need to investigate the impact of compression on pig performance
 

 Key: PIG-1501
 URL: https://issues.apache.org/jira/browse/PIG-1501
 Project: Pig
  Issue Type: Test
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: compress_perf_data.txt, compress_perf_data_2.txt, 
 PIG-1501.patch


 We would like to understand how compressing map results as well as well as 
 reducer output in a chain of MR jobs impacts performance. We can use PigMix 
 queries for this investigation.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1542) log level not propogated to MR task loggers

2010-08-11 Thread Thejas M Nair (JIRA)
log level not propogated to MR task loggers
---

 Key: PIG-1542
 URL: https://issues.apache.org/jira/browse/PIG-1542
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
 Fix For: 0.8.0


Specifying -d DEBUG does not affect the logging of the MR tasks .
This was fixed earlier in PIG-882 .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1448) Detach tuple from inner plans of physical operator

2010-08-11 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1448:
---

Attachment: PIG-1448.1.patch

 Detach tuple from inner plans of physical operator 
 ---

 Key: PIG-1448
 URL: https://issues.apache.org/jira/browse/PIG-1448
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0, 0.5.0, 0.6.0, 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: multi_oom_filt.pig, PIG-1448.1.patch


 This is a follow-up on PIG-1446 which only addresses this general problem for 
 a specific instance of For Each. In general, all the physical operators which 
 can have inner plans are vulnerable to this. Few of them include 
 POLocalRearrange, POFilter, POCollectedGroup etc.  Need to fix all of these.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1448) Detach tuple from inner plans of physical operator

2010-08-11 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1448:
---

Status: Patch Available  (was: Open)

 Detach tuple from inner plans of physical operator 
 ---

 Key: PIG-1448
 URL: https://issues.apache.org/jira/browse/PIG-1448
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.7.0, 0.6.0, 0.5.0, 0.4.0, 0.3.0, 0.2.0, 0.1.0
Reporter: Ashutosh Chauhan
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: multi_oom_filt.pig, PIG-1448.1.patch


 This is a follow-up on PIG-1446 which only addresses this general problem for 
 a specific instance of For Each. In general, all the physical operators which 
 can have inner plans are vulnerable to this. Few of them include 
 POLocalRearrange, POFilter, POCollectedGroup etc.  Need to fix all of these.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1541) FR Join shouldn't match null values

2010-08-11 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1541:
--

Status: Patch Available  (was: Open)

 FR Join shouldn't match null values
 ---

 Key: PIG-1541
 URL: https://issues.apache.org/jira/browse/PIG-1541
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1541.patch


 Here is an example:
 Data input:
 {code}
 1   1
 2
 {code}
 the script 
 {code}
 a = load 'input';
 b = load 'input';
 c = join a by $0, b by $0 using 'repl';
 dump c; 
 {code}
 generates results that matches null values:
 {code}
 (1,1,1,1)
 (,2,,2)
 {code}
 The regular join, on the other hand, gives the correct results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1541) FR Join shouldn't match null values

2010-08-11 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1541:
--

Attachment: PIG-1541.patch

 FR Join shouldn't match null values
 ---

 Key: PIG-1541
 URL: https://issues.apache.org/jira/browse/PIG-1541
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1541.patch


 Here is an example:
 Data input:
 {code}
 1   1
 2
 {code}
 the script 
 {code}
 a = load 'input';
 b = load 'input';
 c = join a by $0, b by $0 using 'repl';
 dump c; 
 {code}
 generates results that matches null values:
 {code}
 (1,1,1,1)
 (,2,,2)
 {code}
 The regular join, on the other hand, gives the correct results.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.