[jira] Subscription: PIG patch available

2019-01-22 Thread jira
Issue Subscription
Filter: PIG patch available (37 issues)

Subscriber: pigdaily

Key Summary
PIG-5377Move supportsParallelWriteToStoreLocation from StoreFunc to 
StoreFuncInterfce
https://issues.apache.org/jira/browse/PIG-5377
PIG-5369Add llap-client dependency
https://issues.apache.org/jira/browse/PIG-5369
PIG-5360Pig sets working directory of input file systems causes exception 
thrown
https://issues.apache.org/jira/browse/PIG-5360
PIG-5338Prevent deep copy of DataBag into Jython List
https://issues.apache.org/jira/browse/PIG-5338
PIG-5323Implement LastInputStreamingOptimizer in Tez
https://issues.apache.org/jira/browse/PIG-5323
PIG-5273_SUCCESS file should be created at the end of the job
https://issues.apache.org/jira/browse/PIG-5273
PIG-5267Review of org.apache.pig.impl.io.BufferedPositionedInputStream
https://issues.apache.org/jira/browse/PIG-5267
PIG-5256Bytecode generation for POFilter and POForeach
https://issues.apache.org/jira/browse/PIG-5256
PIG-5160SchemaTupleFrontend.java is not thread safe, cause PigServer thrown 
NPE in multithread env
https://issues.apache.org/jira/browse/PIG-5160
PIG-5115Builtin AvroStorage generates incorrect avro schema when the same 
pig field name appears in the alias
https://issues.apache.org/jira/browse/PIG-5115
PIG-5106Optimize when mapreduce.input.fileinputformat.input.dir.recursive 
set to true
https://issues.apache.org/jira/browse/PIG-5106
PIG-5081Can not run pig on spark source code distribution
https://issues.apache.org/jira/browse/PIG-5081
PIG-5080Support store alias as spark table
https://issues.apache.org/jira/browse/PIG-5080
PIG-5057IndexOutOfBoundsException when pig reducer processOnePackageOutput
https://issues.apache.org/jira/browse/PIG-5057
PIG-5029Optimize sort case when data is skewed
https://issues.apache.org/jira/browse/PIG-5029
PIG-4926Modify the content of start.xml for spark mode
https://issues.apache.org/jira/browse/PIG-4926
PIG-4913Reduce jython function initiation during compilation
https://issues.apache.org/jira/browse/PIG-4913
PIG-4849pig on tez will cause tez-ui to crash,because the content from 
timeline server is too long. 
https://issues.apache.org/jira/browse/PIG-4849
PIG-4750REPLACE_MULTI should compile Pattern once and reuse it
https://issues.apache.org/jira/browse/PIG-4750
PIG-4684Exception should be changed to warning when job diagnostics cannot 
be fetched
https://issues.apache.org/jira/browse/PIG-4684
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues.apache.org/jira/browse/PIG-4656
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4551Partition filter is not pushed down in case of SPLIT
https://issues.apache.org/jira/browse/PIG-4551
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4373Implement PIG-3861 in Tez
https://issues.apache.org/jira/browse/PIG-4373
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587
PIG-1804Alow Jython function to implement Algebraic and/or Accumulator 
interfaces
https://issues.apache.org/jira/browse/PIG-1804

You may edit this subscription at:
https://issues.apache.org/jira/secure/EditSubscription!default.jspa?subId=16328=12322384


[jira] [Updated] (PIG-5378) Optimize DISTINCT COUNT inside foreach

2019-01-22 Thread Rohini Palaniswamy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-5378:

Description: 
When there is DISTINCT COUNT, the combiner is usually applied. In too many of 
our scripts, have seen that the DISTINCT bag grows to 10s of thousands or 
millions of items making the hash aggregation really worse. Even if hash 
aggregation is turned off, the combiner will still aggregate and in the reducer 
there is way too much spill because of big bag.

This can be avoided if we apply secondary sort with ordering and make it use 
POSortedDistinct. Just PODistinct is still not good enough as it will need to 
hold all the elements in a HashSet. POSortedDistinct requires no memory at all.

Two things to be done:
1) If we see a distinct count, turn it into a POSortedDistinct using 
SecondaryKeyOptimizer. Currently CombinerOptimizer runs first. We need to turn 
off applying combiner optimizer for distinct. Can make this configurable using 
pig.optimize.nested.distinct = true and keep it default in our clusters.
2) SecondaryKeyOptimizer is not converting it into POSortedDistinct in below 
case because of a POForEach in plan before PODistinct 
(https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/util/SecondaryKeyOptimizerUtil.java#L529-L533).

{code}
B = GROUP A BY f1;
C = FOREACH B {
sorted = ORDER A by f2;
unique = DISTINCT sorted.f2;
GENERATE group, COUNT(unique) as cnt;
}
{code}

does not generate POSortedDistinct and has to be fixed. Worked around by doing

{code}
B = GROUP A BY f1;
C = FOREACH B {
fields = A.f2;
sorted = ORDER A by f2;
unique = DISTINCT sorted;
GENERATE group, COUNT(unique) as cnt;
}
{code}




  was:
When there is DISTINCT COUNT, the combiner is usually applied. In too many of 
our scripts, have seen that the DISTINCT bag grows to 10s of thousands or 
millions of items making the hash aggregation really worse. Even if hash 
aggregation is turned off, the combiner will still aggregate and in the reducer 
there is way too much spill because of big bag.

This can be avoided if we apply secondary sort with ordering and make it use 
POSortedDistinct. Just PODistinct is still not good enough as it will need to 
hold all the elements in a HashSet. POSortedDistinct requires no memory at all.

Two things to be done:
1) If we see a distinct count, turn it into a POSortedDistinct using 
SecondaryKeyOptimizer. Currently CombinerOptimizer runs first. We need to turn 
off applying combiner optimizer for distinct. Can make this configurable using 
pig.optimize.nested.distinct = true and keep it default in our clusters.
2) SecondaryKeyOptimizer is not converting it into POSortedDistinct in below 
case because of a POForEach in plan before PODistinct 
(https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/util/SecondaryKeyOptimizerUtil.java#L529-L533).

{code}
B = GROUP A BY f1;
C = FOREACH B {
sorted = ORDER A by f2;
unique = DISTINCT sorted.f2;
GENERATE group, COUNT(unique) as cnt;
}
{code}

does not generate POSortedDistinct and has to be fixed. Worked around by doing

{code}
B = GROUP A BY f1;
C = FOREACH B {
fields = A.f2;
sorted = ORDER A by f2;
unique = DISTINCT sorted;
GENERATE group, COUNT(unique) as cnt;
}





> Optimize DISTINCT COUNT inside foreach
> --
>
> Key: PIG-5378
> URL: https://issues.apache.org/jira/browse/PIG-5378
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>Priority: Major
>
> When there is DISTINCT COUNT, the combiner is usually applied. In too many of 
> our scripts, have seen that the DISTINCT bag grows to 10s of thousands or 
> millions of items making the hash aggregation really worse. Even if hash 
> aggregation is turned off, the combiner will still aggregate and in the 
> reducer there is way too much spill because of big bag.
> This can be avoided if we apply secondary sort with ordering and make it use 
> POSortedDistinct. Just PODistinct is still not good enough as it will need to 
> hold all the elements in a HashSet. POSortedDistinct requires no memory at 
> all.
> Two things to be done:
> 1) If we see a distinct count, turn it into a POSortedDistinct using 
> SecondaryKeyOptimizer. Currently CombinerOptimizer runs first. We need to 
> turn off applying combiner optimizer for distinct. Can make this configurable 
> using pig.optimize.nested.distinct = true and keep it default in our clusters.
> 2) SecondaryKeyOptimizer is not converting it into POSortedDistinct in below 
> case because of a POForEach in plan before PODistinct 
> 

[jira] [Created] (PIG-5378) Optimize DISTINCT COUNT inside foreach

2019-01-22 Thread Rohini Palaniswamy (JIRA)
Rohini Palaniswamy created PIG-5378:
---

 Summary: Optimize DISTINCT COUNT inside foreach
 Key: PIG-5378
 URL: https://issues.apache.org/jira/browse/PIG-5378
 Project: Pig
  Issue Type: Improvement
Reporter: Rohini Palaniswamy


When there is DISTINCT COUNT, the combiner is usually applied. In too many of 
our scripts, have seen that the DISTINCT bag grows to 10s of thousands or 
millions of items making the hash aggregation really worse. Even if hash 
aggregation is turned off, the combiner will still aggregate and in the reducer 
there is way too much spill because of big bag.

This can be avoided if we apply secondary sort with ordering and make it use 
POSortedDistinct. Just PODistinct is still not good enough as it will need to 
hold all the elements in a HashSet. POSortedDistinct requires no memory at all.

Two things to be done:
1) If we see a distinct count, turn it into a POSortedDistinct using 
SecondaryKeyOptimizer. Currently CombinerOptimizer runs first. We need to turn 
off applying combiner optimizer for distinct. Can make this configurable using 
pig.optimize.nested.distinct = true and keep it default in our clusters.
2) SecondaryKeyOptimizer is not converting it into POSortedDistinct in below 
case because of a POForEach in plan before PODistinct 
(https://github.com/apache/pig/blob/branch-0.17/src/org/apache/pig/backend/hadoop/executionengine/util/SecondaryKeyOptimizerUtil.java#L529-L533).

{code}
B = GROUP A BY f1;
C = FOREACH B {
sorted = ORDER A by f2;
unique = DISTINCT sorted.f2;
GENERATE group, COUNT(unique) as cnt;
}
{code}

does not generate POSortedDistinct and has to be fixed. Worked around by doing

{code}
B = GROUP A BY f1;
C = FOREACH B {
fields = A.f2;
sorted = ORDER A by f2;
unique = DISTINCT sorted;
GENERATE group, COUNT(unique) as cnt;
}






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)