[jira] Resolved: (PIG-916) Change the pig hbase interface to get more than one row at a time when scanning
[ https://issues.apache.org/jira/browse/PIG-916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy resolved PIG-916. --- Fix Version/s: 0.8.0 Resolution: Duplicate Fixed in PIG-1205 Change the pig hbase interface to get more than one row at a time when scanning --- Key: PIG-916 URL: https://issues.apache.org/jira/browse/PIG-916 Project: Pig Issue Type: Improvement Reporter: Alex Newman Assignee: Dmitriy V. Ryaboy Priority: Trivial Fix For: 0.8.0 It should be significantly faster to get numerous rows at the same time rather than one row at a time for large table extraction processes. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1611) use enums for error code
[ https://issues.apache.org/jira/browse/PIG-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909493#action_12909493 ] Dmitriy V. Ryaboy commented on PIG-1611: +140 use enums for error code Key: PIG-1611 URL: https://issues.apache.org/jira/browse/PIG-1611 Project: Pig Issue Type: Sub-task Reporter: Thejas M Nair Fix For: 0.9.0 Pig code is using integer constants for error code, and the value of the error code is reserved using http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification . This process is cumbersome and error prone. It will be better to use enum values instead. The enum value can contain the error message and encapsulate the error code. For example - {code} Replace throw new SchemaMergeException(Error in merging schema, 2124, PigException.BUG); with throw new SchemaMergeException(SCHEMA_MERGE_EX, PigException.BUG); {code} Where SCHEMA_MERGE_EX belongs to a error codes enum. We can use the ordinal value of the enum and an offset to determine the error code. The error code will be passed through the constructor of the enum. {code} SCHEMA_MERGE_EX(Error in merging schema); {code} For documentation, the error code and error messages can be dumped using code that uses the enum error code class. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1602) The .classpath of eclipse template still use hbase-0.20.0
[ https://issues.apache.org/jira/browse/PIG-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906451#action_12906451 ] Dmitriy V. Ryaboy commented on PIG-1602: +1 The .classpath of eclipse template still use hbase-0.20.0 - Key: PIG-1602 URL: https://issues.apache.org/jira/browse/PIG-1602 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Priority: Minor Fix For: 0.8.0 Attachments: PIG_1602.patch The .classpath of eclipse template still use hbase-0.20.0, it should be updated to hbase-0.20.6 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1597) Development snapshot jar no longer picked up by bin/pig
[ https://issues.apache.org/jira/browse/PIG-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1597: --- Status: Resolved (was: Patch Available) Fix Version/s: 0.9.0 Resolution: Fixed Committed. Development snapshot jar no longer picked up by bin/pig --- Key: PIG-1597 URL: https://issues.apache.org/jira/browse/PIG-1597 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0, 0.9.0 Attachments: PIG_1597.patch As George Stathis poined out in PIG-1596, bin/pig no longer picks up development pig jars. This appears to have been introduced in PIG-1334, as the jar was renamed from -dev- to -SNAPSHOT- -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1596) NPE's thrown when attempting to load hbase columns containing null values
[ https://issues.apache.org/jira/browse/PIG-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906305#action_12906305 ] Dmitriy V. Ryaboy commented on PIG-1596: +1 before committing, do you mind combining the new test with one of the existing ones? Trying to keep the test suite at under 24 hours :) NPE's thrown when attempting to load hbase columns containing null values - Key: PIG-1596 URL: https://issues.apache.org/jira/browse/PIG-1596 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.7.0 Reporter: George P. Stathis Fix For: 0.8.0, 0.9.0 Attachments: null_hbase_records.patch, PIG_1596.patch, PIG_1596_2.patch I'm not a committer, but I'd like to suggest the attached patch to handle loading hbase rows containing null cell values (since hbase is all about sparsly populated data rows). As it stands, a DataByteArray can be created with a null mData if a cell has no value, which causes NPEs by simply attempting to load a row containing the null cell in question. PS: the attached patch also contains a slight change to the bin/pig executable to point to the build/pig\-\*\-SNAPSHOT.jar and not the build/pig\-\*\-dev.jar (the latter no longer seems to exist). If you prefer a separate patch for this, I'll be happy to submit it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1596) NPE's thrown when attempting to load hbase columns containing null values
[ https://issues.apache.org/jira/browse/PIG-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905818#action_12905818 ] Dmitriy V. Ryaboy commented on PIG-1596: I will review on Friday. NPE's thrown when attempting to load hbase columns containing null values - Key: PIG-1596 URL: https://issues.apache.org/jira/browse/PIG-1596 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.7.0 Reporter: George P. Stathis Fix For: 0.8.0, 0.9.0 Attachments: null_hbase_records.patch, PIG_1596.patch, PIG_1596_2.patch I'm not a committer, but I'd like to suggest the attached patch to handle loading hbase rows containing null cell values (since hbase is all about sparsly populated data rows). As it stands, a DataByteArray can be created with a null mData if a cell has no value, which causes NPEs by simply attempting to load a row containing the null cell in question. PS: the attached patch also contains a slight change to the bin/pig executable to point to the build/pig\-\*\-SNAPSHOT.jar and not the build/pig\-\*\-dev.jar (the latter no longer seems to exist). If you prefer a separate patch for this, I'll be happy to submit it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1597) Development snapshot jar no longer picked up by bin/pig
[ https://issues.apache.org/jira/browse/PIG-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905819#action_12905819 ] Dmitriy V. Ryaboy commented on PIG-1597: Does anyone object to me just committing this to 0.8 and trunk? Development snapshot jar no longer picked up by bin/pig --- Key: PIG-1597 URL: https://issues.apache.org/jira/browse/PIG-1597 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1597.patch As George Stathis poined out in PIG-1596, bin/pig no longer picks up development pig jars. This appears to have been introduced in PIG-1334, as the jar was renamed from -dev- to -SNAPSHOT- -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905612#action_12905612 ] Dmitriy V. Ryaboy commented on PIG-794: --- Doug and Scott will know better of course, but afaik, Avro doesn't support Object keys. You can cheat and turn Object keys into strings by Base64-encoding their serialized representations.. you'd have to know to reverse the process when deserializing, though. Or we can try to get rid of InternalMap. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905775#action_12905775 ] Dmitriy V. Ryaboy commented on PIG-794: --- Jeff, that's what I am saying -- since they are writables, we can turn them into strings and not need InternalMap at all. Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1592) ORDER BY distribution is uneven when record size is correlated with order key
ORDER BY distribution is uneven when record size is correlated with order key - Key: PIG-1592 URL: https://issues.apache.org/jira/browse/PIG-1592 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Fix For: 0.9.0 The partitioner contributed in PIG-545 distributes the order key space between partitions so that each partition gets approximately the same number of keys, even when the keys have a non-uniform distribution over the key space. Unfortunately this still allows for severe partition imbalance when record size is correlated with the order key. By way of motivating example, consider this script which attempts to produce a list of genuses based on how many species each genus contains: {code} set default_parallel 60; critters = load 'biodata'' as (genus, species); genus_counts = foreach (group critters by genus) generate group as genus, COUNT(critters) as num_species, critters; ordered_genuses = order genus_counts by num_species desc; store ordered_genuses {code} The higher the value of genus_counts, the more species tuples will be contained in the critters bag, the wider the row. This can cause a severe processing imbalance, as the partitioner processing the records with the highest values of genus_counts will have the same number of *records* as the partitioner processing the lowest number, but it will have far more actual *bytes* to work on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1592) ORDER BY distribution is uneven when record size is correlated with order key
[ https://issues.apache.org/jira/browse/PIG-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905220#action_12905220 ] Dmitriy V. Ryaboy commented on PIG-1592: One proposal is to simply change the default weighted range partitioner to take into account the record size. If record size is uniform, or uniformly distributed, or non-uniformly distributed but independent of the order key, this change shouldn't materially affect the distributions created for data sets not covered by this issue. ORDER BY distribution is uneven when record size is correlated with order key - Key: PIG-1592 URL: https://issues.apache.org/jira/browse/PIG-1592 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy Fix For: 0.9.0 The partitioner contributed in PIG-545 distributes the order key space between partitions so that each partition gets approximately the same number of keys, even when the keys have a non-uniform distribution over the key space. Unfortunately this still allows for severe partition imbalance when record size is correlated with the order key. By way of motivating example, consider this script which attempts to produce a list of genuses based on how many species each genus contains: {code} set default_parallel 60; critters = load 'biodata'' as (genus, species); genus_counts = foreach (group critters by genus) generate group as genus, COUNT(critters) as num_species, critters; ordered_genuses = order genus_counts by num_species desc; store ordered_genuses {code} The higher the value of genus_counts, the more species tuples will be contained in the critters bag, the wider the row. This can cause a severe processing imbalance, as the partitioner processing the records with the highest values of genus_counts will have the same number of *records* as the partitioner processing the lowest number, but it will have far more actual *bytes* to work on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1596) NPE's thrown when attempting to load hbase columns containing null values
[ https://issues.apache.org/jira/browse/PIG-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905404#action_12905404 ] Dmitriy V. Ryaboy commented on PIG-1596: Jeff, I think it's clearer if you insert null into the tuple, not an empty DataByteArray (and assertNull in the test) George, the SNAPSHOT thing is a real bug, thanks for catching that, this happened when pig was made available through maven in PIG-1334. I'll create a separate ticket for that. NPE's thrown when attempting to load hbase columns containing null values - Key: PIG-1596 URL: https://issues.apache.org/jira/browse/PIG-1596 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.7.0 Reporter: George P. Stathis Fix For: 0.8.0, 0.9.0 Attachments: null_hbase_records.patch, PIG_1596.patch I'm not a committer, but I'd like to suggest the attached patch to handle loading hbase rows containing null cell values (since hbase is all about sparsly populated data rows). As it stands, a DataByteArray can be created with a null mData if a cell has no value, which causes NPEs by simply attempting to load a row containing the null cell in question. PS: the attached patch also contains a slight change to the bin/pig executable to point to the build/pig\-\*\-SNAPSHOT.jar and not the build/pig\-\*\-dev.jar (the latter no longer seems to exist). If you prefer a separate patch for this, I'll be happy to submit it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1597) Development snapshot jar no longer picked up by bin/pig
Development snapshot jar no longer picked up by bin/pig --- Key: PIG-1597 URL: https://issues.apache.org/jira/browse/PIG-1597 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 As George Stathis poined out in PIG-1596, bin/pig no longer picks up development pig jars. This appears to have been introduced in PIG-1334, as the jar was renamed from -dev- to -SNAPSHOT- -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1597) Development snapshot jar no longer picked up by bin/pig
[ https://issues.apache.org/jira/browse/PIG-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1597: --- Status: Patch Available (was: Open) Development snapshot jar no longer picked up by bin/pig --- Key: PIG-1597 URL: https://issues.apache.org/jira/browse/PIG-1597 Project: Pig Issue Type: Bug Components: grunt Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1597.patch As George Stathis poined out in PIG-1596, bin/pig no longer picks up development pig jars. This appears to have been introduced in PIG-1334, as the jar was renamed from -dev- to -SNAPSHOT- -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-794) Use Avro serialization in Pig
[ https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904615#action_12904615 ] Dmitriy V. Ryaboy commented on PIG-794: --- Jeff, have you checkoed out Scott Carey's work here: https://issues.apache.org/jira/browse/AVRO-592 ? Use Avro serialization in Pig - Key: PIG-794 URL: https://issues.apache.org/jira/browse/PIG-794 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.2.0 Reporter: Rakesh Setty Assignee: Dmitriy V. Ryaboy Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, jackson-asl-0.9.4.jar, PIG-794.patch We would like to use Avro serialization in Pig to pass data between MR jobs instead of the current BinStorage. Attached is an implementation of AvroBinStorage which performs significantly better compared to BinStorage on our benchmarks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1205: --- Attachment: PIG_1205_9.patch Patch with the StoreCaster changes as suggested by Alan. With +1s from Alan and Jeff, committing. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: hbase-0.20.6-test.jar, hbase-0.20.6.jar, PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch, PIG_1205_9.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904325#action_12904325 ] Dmitriy V. Ryaboy commented on PIG-1205: Re HBASE-1933, they are publishing snapshots of current trunk, not the 0.20 branch. We'll be able to start using maven to pull down hbase when we upgrade to their 0.9 release (which iirc depends on hdfs appends...) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: hbase-0.20.6-test.jar, hbase-0.20.6.jar, PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch, PIG_1205_9.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1205: --- Status: Resolved (was: Patch Available) Release Note: HBaseStorage has been significantly reworked with this release. Usage: {code} my_data = LOAD 'hbase://table_name' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colfamily:col1 colfamily:col2', '-caching 100') as (col1:int, col2:chararray); STORE my_date INTO 'hbaseL//other_table' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('colfamily:col1 colfamily:col2'); {code} HBaseStorage can now write data into HBase as well as read it. The first argument is a space-delimited list of columns to be loaded (or stored). Columns are specified as columnfamily:column_name. The second argument is an optional set of key-value pairs used to control HBaseStorage behavior. Available arguments are: * {{monospaced}}-loadKey{{monospaced}} Used to load the row key; false by default. If true, the first field in the returned tuple will be the value of the row key. * {{monospaced}}-gt, -gte, -lt, and -lte{{monospaced}} Used to specify bounds on row keys to be scanned. The keys are specified as binary data, using the hex representation. Any slashes have to be double-escaped (two slashes per single real slash) to be parsed correctly. * {{monospaced}}-caching{{monospaced}} Used to specify the number of rows to be cached per HBase RPC call. See http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#setScannerCaching%28int%29 for more information about this HBase feature. * {{monospaced}}-limit{{monospaced}} Used to control how many rows *per scanned region* will be retrieved. This can of course speed up processing if you just want a few rows. The total number of rows returned will be up to number of regions * limit. The limit is applied after any -gt, -lt, etc filters. Pig's LIMIT operator can be used in conjunction with this argument. * {{monospaced}}-caster{{monospaced}} Used to specify a LoadCaster (or LoadStoreCaster, for storage) used to convert the data stored in HBase into Pig data. By default, the Utf8StorageConverter is used, which stores all data as its string representation. The string HBaseBinaryConverter can be used to specify that data is stored in HBase's native binary format. Note that the HBaseBinary converter does not work with complex data types such as maps, tuples, and bags. You can also specify a full class path such as org.apache.pig.backend.hadoop.hbase.HBaseBinaryConverter to use your own Caster. The default caster can be changed by setting the pig.hbase.caster property in pig,properties HBaseStorage matches column arguments to tuple fields based on their ordinal position. When storing, the first field is expected to be the key value. Resolution: Fixed Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: hbase-0.20.6-test.jar, hbase-0.20.6.jar, PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch, PIG_1205_9.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1205: --- Attachment: hbase-0.20.6.jar hbase-0.20.6-test.jar Attaching the hbase-0.20.6 jars HBase is an apache project, so no license issues. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: hbase-0.20.6-test.jar, hbase-0.20.6.jar, PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903634#action_12903634 ] Dmitriy V. Ryaboy commented on PIG-1150: I won't have time before the 30th. BTW one doesn't even need a udf if using the sum of squares approach.. :-) just generate the square and the sum in the foreach (it will perform the algebraic decomposition automatically) VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903636#action_12903636 ] Dmitriy V. Ryaboy commented on PIG-1563: Sounds good. Should we just merge in the amazon contrib for some of these? SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903643#action_12903643 ] Dmitriy V. Ryaboy commented on PIG-1150: Yeah I think it's not a big deal if we are splitting piggybank out soon anyway. VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903644#action_12903644 ] Dmitriy V. Ryaboy commented on PIG-1563: Olga, the amazon contrib is PIG-1565 SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1564) add support for multiple filesystems
[ https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903728#action_12903728 ] Dmitriy V. Ryaboy commented on PIG-1564: Andrew, does 'fs -cd s3://anhi-test-data/' work? The cd command is also deprecated (though not marked as such) :) add support for multiple filesystems Key: PIG-1564 URL: https://issues.apache.org/jira/browse/PIG-1564 Project: Pig Issue Type: Improvement Reporter: Andrew Hitchcock Attachments: PIG-1564-1.patch Currently you can't run Pig scripts that read data from one file system and write it to another. Also, Grunt doesn't support CDing from one directory to another on different file systems. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903753#action_12903753 ] Dmitriy V. Ryaboy commented on PIG-1563: +1 question/comment -- any reason you discarded the new buildSimpleFuncSpec I wrote in the first iteration of this patch? I think it simplifies the code: {code} funcList.add(Utils.buildSimpleFuncSpec( this.getClass().getName(), DataType.CHARARRAY, DataType.CHARARRAY)); {code} vs {code} Schema s = new Schema(); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); s.add(new Schema.FieldSchema(null, DataType.CHARARRAY)); funcList.add(new FuncSpec(this.getClass().getName(), s)); {code} SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch, PIG_1563_v2.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1555) [piggybank] add CSV Loader
[ https://issues.apache.org/jira/browse/PIG-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1555: --- Status: Resolved (was: Patch Available) Release Note: CSVLoader can be used to load comma-separated value files. It properly handles commas included inside quoted fields, and quotes escaped by preceding them with another quote character (Excel-style). CSVLoader only handle single-line entries; quoting a multi-line value will *not* work. Resolution: Fixed [piggybank] add CSV Loader -- Key: PIG-1555 URL: https://issues.apache.org/jira/browse/PIG-1555 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.8.0 Attachments: PIG_1555.patch Users often ask for a CSV loader that can handle quoted commas. Let's get 'er done. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1518) multi file input format for loaders
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903031#action_12903031 ] Dmitriy V. Ryaboy commented on PIG-1518: This is a great feature, thanks Yan. Could you comment on what the final solution was as far as PigStorage and OrderedLoadFunc? I see two ideas (yours and Ashutosh's) in the discussion, but not what the ultimate direction you took was. multi file input format for loaders --- Key: PIG-1518 URL: https://issues.apache.org/jira/browse/PIG-1518 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Yan Zhou Fix For: 0.8.0 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch We frequently run in the situation where Pig needs to deal with small files in the input. In this case a separate map is created for each file which could be very inefficient. It would be greate to have an umbrella input format that can take multiple files and use them in a single split. We would like to see this working with different data formats if possible. There are already a couple of input formats doing similar thing: MultifileInputFormat as well as CombinedInputFormat; howevere, neither works with ne Hadoop 20 API. We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1563) SUBSTRING function is broken
[ https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1563: --- Status: Patch Available (was: Open) Affects Version/s: 0.8.0 SUBSTRING function is broken Key: PIG-1563 URL: https://issues.apache.org/jira/browse/PIG-1563 Project: Pig Issue Type: Bug Affects Versions: 0.8.0 Reporter: Olga Natkovich Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1563.patch Script: A = load 'studenttab10k' as (name, age, gpa); C = foreach A generate SUBSTRING(name, 0,5); E = limit C 10; dump E; Output is always empty: () () () () () () () () () () -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1551: --- Status: Open (was: Patch Available) Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch, PIG_1551.2.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1551: --- Attachment: PIG_1551.2.patch Attaching patch that fixes the two errors Richard pointed out. Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch, PIG_1551.2.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1551: --- Status: Patch Available (was: Open) Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch, PIG_1551.2.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901787#action_12901787 ] Dmitriy V. Ryaboy commented on PIG-1205: Jeff, Thanks a lot for pitching in with the tests! I was using 0.20.0 and the old tests passed. I've only tested the binary conversion stuff and other new features on the Twitter machines, and they do run a later HBase version -- perhaps the incompatibility is in the filters or binary casters code? Do you know which tests fail with 0.20.0? I will definitely add a bunch of documentation. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902008#action_12902008 ] Dmitriy V. Ryaboy commented on PIG-1205: Ok, let's upgrade to 20.6 then. We could work around by serializing the filters ourselves, and applying them to the scan when reading the UDFContext, but seems a bit overboard, and folks should be upgrading anyway. *Commiters*: this is ready for review. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1551: --- Attachment: PIG_1551.3.patch Ugh. Thank you for catching that -- fixed, and added a test to make sure it stays fixed. The particular set of methods I needed this for used primitives, so that's what I did. It's a bit tricky to add support for Long, Double, etc arrays, as I would have to check all combinations of possible method signatures when seeing things like (int[], int[], int[]) -- it becomes fairly ugly code.. Do you think this is particularly compelling? I can't really think of methods that take arrays of Number classes; usually, if you start using Numbers, you are also using Collections, not plain arrays. Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch, PIG_1551.2.patch, PIG_1551.3.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1551: --- Status: Resolved (was: Patch Available) Release Note: The idea is simple: frequently, Pig users need to use a simple function that is already provided by standard Java libraries, but for which a UDF has not been written. Dynamic Invokers allow a Pig programmer to refer to Java functions without having to wrap them in custom Pig UDFs, at the cost of doing some Java reflection on every function call. {code} DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String'); encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray); decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8'); {code} Currently, Dynamic Invokers can be used for any static function that accepts no arguments or some combination of Strings, ints, longs, doubles, floats, or arrays of same, and returns a String, an int, a long, a double, or a float. Primitives only for the numbers, no capital-letter numeric classes as arguments. Depending on the return type, a specific kind of Invoker must be used: InvokeForString, InvokeForInt, InvokeForLong, InvokeForDouble, or InvokeForFloat. The DEFINE keyword is used to bind a keyword to a Java method, as above. The first argument to the InvokeFor* constructor is the full path to the desired method. The second argument is a space-delimited ordered list of the classes of the method arguments. This can be omitted or an empty string if the method takes no arguments. Valid class names are String, Long, Float, Double, and Int. Invokers can also work with array arguments, represented in Pig as DataBags of single-tuple elements. Simply refer to string[], for example. Class names are not case-sensitive. The ability to use invokers on methods that take array arguments makes methods like those in org.apache.commons.math.stat.StatUtils available for processing the results of grouping your datasets, for example. This is very nice, but a word of caution: the resulting UDF will of course not be optimized for Hadoop, and the very significant benefits one gains from implementing the Algebraic and Accumulative interfaces are lost here. Be careful with this one. Resolution: Fixed Commited. Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch, PIG_1551.2.patch, PIG_1551.3.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1354) UDFs for dynamic invocation of simple Java methods
[ https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1354: --- Release Note: Please see PIG-1551 release notes. UDFs for dynamic invocation of simple Java methods -- Key: PIG-1354 URL: https://issues.apache.org/jira/browse/PIG-1354 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch The need to create wrapper UDFs for simple Java functions creates unnecessary work for Pig users, slows down the development process, and produces a lot of trivial classes. We can use Java's reflection to allow invoking a number of methods on the fly, dynamically, by creating a generic UDF to accomplish this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1354) UDFs for dynamic invocation of simple Java methods
[ https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901584#action_12901584 ] Dmitriy V. Ryaboy commented on PIG-1354: Olga, There is a follow-up ticket here: https://issues.apache.org/jira/browse/PIG-1551 If that gets committed, I have a pretty detailed explanation of how to use the stuff in http://squarecog.wordpress.com/2010/08/20/upcoming-features-in-pig-0-8-dynamic-invokers/ (happy to put the link in release notes, or just paste the whole post). UDFs for dynamic invocation of simple Java methods -- Key: PIG-1354 URL: https://issues.apache.org/jira/browse/PIG-1354 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch The need to create wrapper UDFs for simple Java functions creates unnecessary work for Pig users, slows down the development process, and produces a lot of trivial classes. We can use Java's reflection to allow invoking a number of methods on the fly, dynamically, by creating a generic UDF to accomplish this. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1555) [piggybank] add CSV Loader
[ https://issues.apache.org/jira/browse/PIG-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901697#action_12901697 ] Dmitriy V. Ryaboy commented on PIG-1555: Alan, The differences I observe when running on actual csv files are within the margin of error -- sometimes CSVLoader comes out on top. Then again I am reading actual CSVs with quoted commas, so it's possible that the similarity in runtimes is due to the fact that PigStorage sees the commas and allocates extra tuple fields. -D [piggybank] add CSV Loader -- Key: PIG-1555 URL: https://issues.apache.org/jira/browse/PIG-1555 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.8.0 Attachments: PIG_1555.patch Users often ask for a CSV loader that can handle quoted commas. Let's get 'er done. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1205: --- Attachment: PIG_1205_7.patch Implemented LoadPushDown (NOTE: this involved a slight backwards-compatible refactoring of Utf8StorageConverter). Refactored the tests a bit. At this point I think we are good except for further testing and documentation. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1205: --- Status: Patch Available (was: Open) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1555) [piggybank] add CSV Loader
[piggybank] add CSV Loader -- Key: PIG-1555 URL: https://issues.apache.org/jira/browse/PIG-1555 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.8.0 Users often ask for a CSV loader that can handle quoted commas. Let's get 'er done. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1555) [piggybank] add CSV Loader
[ https://issues.apache.org/jira/browse/PIG-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1555: --- Attachment: PIG_1555.patch This is loosely based on the loader by James Kebinger that he open-sourced at http://github.com/jkebinger/pig-user-defined-functions I ported to the new API and fixed a few bugs. Still doesn't support multi-line records, but the basic stuff works, including quoting quotes by more quotes, excel-style. [piggybank] add CSV Loader -- Key: PIG-1555 URL: https://issues.apache.org/jira/browse/PIG-1555 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.8.0 Attachments: PIG_1555.patch Users often ask for a CSV loader that can handle quoted commas. Let's get 'er done. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1555) [piggybank] add CSV Loader
[ https://issues.apache.org/jira/browse/PIG-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1555: --- Status: Patch Available (was: Open) [piggybank] add CSV Loader -- Key: PIG-1555 URL: https://issues.apache.org/jira/browse/PIG-1555 Project: Pig Issue Type: New Feature Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Priority: Minor Fix For: 0.8.0 Attachments: PIG_1555.patch Users often ask for a CSV loader that can handle quoted commas. Let's get 'er done. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1508) Make 'docs' target (forrest) work with Java 1.6
[ https://issues.apache.org/jira/browse/PIG-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901158#action_12901158 ] Dmitriy V. Ryaboy commented on PIG-1508: In http://comments.gmane.org/gmane.text.xml.forrest.user/4899 a forrest committer says This validate sitemap task doesn't really do much anyway. Its main purpose is to demonstrate the power of using Jing to do xml validation during the build phase. There are other better demonstrations of that. Sounds like this is safe to do. +1 Make 'docs' target (forrest) work with Java 1.6 --- Key: PIG-1508 URL: https://issues.apache.org/jira/browse/PIG-1508 Project: Pig Issue Type: Bug Components: documentation Affects Versions: 0.7.0 Reporter: Carl Steinbach Assignee: Carl Steinbach Attachments: PIG-1508.patch.txt FOR-984 covers the very inconvenient fact that Forrest 0.8 does not work with Java 1.6 The same ticket also suggests a workaround: disabling sitemap and stylesheet validation by setting the forrest.validate.sitemap and forrest.validate.stylesheets properties to false. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1237) Piggybank MutliStorage - specify field to write in output
[ https://issues.apache.org/jira/browse/PIG-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901161#action_12901161 ] Dmitriy V. Ryaboy commented on PIG-1237: Gerrit, Sorry this fell through the cracks! Just noticed this ticket. The ability to specify just one column seems very limited. Perhaps instead one could optionally specify whether to materialize the splitField? I think this would accomplish the same thing in a more general manner. Also perhaps this warrants a second constructor, as introducing new arguments to the existing one will break backwards compatibility. Piggybank MutliStorage - specify field to write in output - Key: PIG-1237 URL: https://issues.apache.org/jira/browse/PIG-1237 Project: Pig Issue Type: Improvement Reporter: Gerrit Jansen van Vuuren Assignee: Gerrit Jansen van Vuuren Priority: Minor Attachments: PIG-1237.patch I've made a modification to the piggy bank MutliStorage class that allows to optionally specify the index of the field in each tuple to write to output. This feature allows to have records with metadata like seqno, time of upload etc, and then to combine files from these records into one but without the metadata. e.g. 1: date type seq1 data 2: date type seq2 data then write output grouped by type and ordered by sequence: data data -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1205: --- Status: Open (was: Patch Available) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1205: --- Status: Patch Available (was: Open) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1205: --- Attachment: PIG_1205_6.patch Fixed test (but did not add new tests). Made default caster configurable by setting pig.hbase.caster property. Made rowKey filters (gt, lt, gte, lte) filter out regions when possible. Tested manually. Jeff, to your comments about shifting to cut off regions -- I think it's better to have the loader think about region sizes, and let the user only worry about key values. If they are intimate enough with their tables to know region boundaries, they should know which end of a region is inclusive and which is exclusive, and provide the correct filters. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy reassigned PIG-1551: -- Assignee: Dmitriy V. Ryaboy Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Reporter: Dmitriy V. Ryaboy PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1551: --- Attachment: PIG-1551.patch Patch attached. Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters
[ https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1551: --- Status: Patch Available (was: Open) Affects Version/s: 0.8.0 Fix Version/s: 0.8.0 Improve dynamic invokers to deal with no-arg methods and array parameters - Key: PIG-1551 URL: https://issues.apache.org/jira/browse/PIG-1551 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1551.patch PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple Java methods in a UDF, so that users don't need to create trivial wrappers if they are ok sacrificing some speed. This issue is to extend the set of methods that can be wrapped this way to include methods that do not take any arguments, and methods that take arrays of {int,long,float,double,string} as arguments. Arrays are expected to be represented by bags in Pig. Notably, this allows users to wrap statistical functions in o.a.commons.math.stat.StatUtils . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple
[ https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1420: --- Attachment: PIG-1420.2.patch This should fix the problem :). LMK if you'd like me to commit this. Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple - Key: PIG-1420 URL: https://issues.apache.org/jira/browse/PIG-1420 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Russell Jurney Assignee: Russell Jurney Fix For: 0.8.0 Attachments: addconcat2.patch, PIG-1420.2.patch Original Estimate: 24h Remaining Estimate: 24h org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and org.apache.pig.builtin.StringConcat (which acts on Strings internally), both act on the first two fields of a tuple. This results in ugly nested CONCAT calls like: CONCAT(CONCAT(A, ' '), B) The more desirable form is: CONCAT(A, ' ', B) This change will be backwards compatible, provided that no one was relying on the fact that CONCAT ignores fields after the first two in a tuple. This seems a reasonable assumption to make, or at least a small break in compatibility for a sizable improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple
[ https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899587#action_12899587 ] Dmitriy V. Ryaboy commented on PIG-1420: Right.. i forgot people don't call StringConcat directly. I don't know how one specifies a vararg schema. Hints? Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple - Key: PIG-1420 URL: https://issues.apache.org/jira/browse/PIG-1420 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Russell Jurney Assignee: Russell Jurney Fix For: 0.8.0 Attachments: addconcat2.patch, PIG-1420.2.patch Original Estimate: 24h Remaining Estimate: 24h org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and org.apache.pig.builtin.StringConcat (which acts on Strings internally), both act on the first two fields of a tuple. This results in ugly nested CONCAT calls like: CONCAT(CONCAT(A, ' '), B) The more desirable form is: CONCAT(A, ' ', B) This change will be backwards compatible, provided that no one was relying on the fact that CONCAT ignores fields after the first two in a tuple. This seems a reasonable assumption to make, or at least a small break in compatibility for a sizable improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple
[ https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899597#action_12899597 ] Dmitriy V. Ryaboy commented on PIG-1420: Yeah, let's plan to add a way to specify a vararg in the schema in 0.9. In the meantime, what do we do with concat? Option 1: leave broken (only works for 2 arguments). Option 2: take out arg2func mapping, and have people who want to concat strings use StringConcat explicitly. Actually, there is an option 3, which makes more sense than option 2: make CONCAT actually do what StringConcat does, and introduce BinConcat (since it seems unlikely people are actually concatting bytearrays...). Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple - Key: PIG-1420 URL: https://issues.apache.org/jira/browse/PIG-1420 Project: Pig Issue Type: Improvement Components: impl Affects Versions: 0.8.0 Reporter: Russell Jurney Assignee: Russell Jurney Fix For: 0.8.0 Attachments: addconcat2.patch, PIG-1420.2.patch Original Estimate: 24h Remaining Estimate: 24h org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and org.apache.pig.builtin.StringConcat (which acts on Strings internally), both act on the first two fields of a tuple. This results in ugly nested CONCAT calls like: CONCAT(CONCAT(A, ' '), B) The more desirable form is: CONCAT(A, ' ', B) This change will be backwards compatible, provided that no one was relying on the fact that CONCAT ignores fields after the first two in a tuple. This seems a reasonable assumption to make, or at least a small break in compatibility for a sizable improvement. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899722#action_12899722 ] Dmitriy V. Ryaboy commented on PIG-1205: bq. 1. Is it possible to specify min_row_key and max_row_key in parameters Even better than that -- you can specify lt, lte, gt, and gte. It's true that as written splits will be created for the whole table, but the filters will cause most of those splits to immediately exit. Not creating the splits is on my todo list (I already do this in the elephantbird version for 0.6) bq. 2. One small suggestion: move line 206 to if block (only one time setting is enough) Good idea. bq. 3. It's better to add warning log in HBaseBinaryConverter when the bytes is cut off for type conversion Will do. bq. 4. The parameter Per-region limit is a bit confusing for me, I think users would like to the set the limit on the whole table not per region. What do you think ? Trouble is, you can't enforce a total limit without post-processing. In practice, I use -limit when I am experimenting and want to get just a few rows from HBase; if I want a specific number of rows, I use both -limit (to speed up the tasks, since the scanners will exit early), and Pig's LIMIT operator (to get the exact number of rows I need). Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1205: --- Attachment: PIG_1205_5.path This patch (not really review-ready yet) introduces the Elephant-Bird improvements. You can use -gt, -gte, -lt, -lte flags to filter out row ranges, specify caching and per-region row limits, and you can specify the caster to use (interpret Strings, as before, or use bytes directly for more eficient storage and communication). The filtering is a bit off because it still spins up all the map tasks, the ones whose keys are filtered out just finish extremely fast. The progress reporting is a bit jittery, but better than nothing. TODO: fix up filtering, add projection pushdown, add filter pushdown, and write better tests. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce
[ https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892375#action_12892375 ] Dmitriy V. Ryaboy commented on PIG-1516: Another workaround for the meantime: One can introduce a SmallBagFactory that inherits from BagFactory and produces SmallBags which implement DataBag() without a finalize, and does not implement the file spilling behavior. SmallBagFactory would return SmallBags when bagFactory.newDefaultBag() is called. Then, provide the system properties pig.data.bag.factory.name and pig.data.bag.factory.jar in pig.properties to point to the new classes. Naturally, one has to be certain that databags won't need to spill to disk when doing this... Ankur -- so what are you suggesting as a fix that avoids finalize? finalize in bag implementations causes pig to run out of memory in reduce -- Key: PIG-1516 URL: https://issues.apache.org/jira/browse/PIG-1516 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 *Problem:* pig bag implementations that are subclasses of DefaultAbstractBag, have finalize methods implemented. As a result, the garbage collector moves them to a finalization queue, and the memory used is freed only after the finalization happens on it. If the bags are not finalized fast enough, a lot of memory is consumed by the finalization queue, and pig runs out of memory. This can happen if large number of small bags are being created. *Solution:* The finalize function exists for the purpose of deleting the spill files that are created when the bag is too large. But if the bags are small enough, no spill files are created, and there is no use of the finalize function. A new class that holds a list of files will be introduced (FileList). This class will have a finalize method that deletes the files. The bags will no longer have finalize methods, and the bags will use FileList instead of ArrayListFile. *Possible workaround for earlier releases:* Since the fix is going into 0.8, here is a workaround - Disabling the combiner will reduce the number of bags getting created, as there will not be the stage of combining intermediate merge results. But I would recommend disabling it only if you have this problem as it is likely to slow down the query . To disable combiner, set the property: -Dpig.exec.nocombiner=true -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892376#action_12892376 ] Dmitriy V. Ryaboy commented on PIG-1205: I can integrate my changes by then. Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1150) VAR() Variance UDF
[ https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891863#action_12891863 ] Dmitriy V. Ryaboy commented on PIG-1150: Meh. Go ahead and commit. Don't put it into builtin, since it has math problems at scale. Ok for piggybank. VAR() Variance UDF -- Key: PIG-1150 URL: https://issues.apache.org/jira/browse/PIG-1150 Project: Pig Issue Type: New Feature Affects Versions: 0.5.0 Environment: UDF, written in Pig 0.5 contrib/ Reporter: Russell Jurney Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: var.patch I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates variance in a distributed manner, based on the AVG() builtin. It works by calculating the count, sum and sum of squares, as described here: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm Is this a worthwhile contribution? Taking the square root of this value using the contrib SQRT() function gives Standard Deviation, which is missing from Pig. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
[ https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891864#action_12891864 ] Dmitriy V. Ryaboy commented on PIG-1205: When is the cut-off date for that? Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc -- Key: PIG-1205 URL: https://issues.apache.org/jira/browse/PIG-1205 Project: Pig Issue Type: Sub-task Affects Versions: 0.7.0 Reporter: Jeff Zhang Assignee: Jeff Zhang Fix For: 0.8.0 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1500) guava.jar should be removed from the lib folder
[ https://issues.apache.org/jira/browse/PIG-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890603#action_12890603 ] Dmitriy V. Ryaboy commented on PIG-1500: Have you tried actually building with this? The reason I put guava r3 into lib was that the public maven deploy for it is broken. Here's what happens when I apply this patch and try to build: {code} [ivy:resolve] WARNINGS [ivy:resolve] problem while downloading module descriptor: http://repo1.maven.org/maven2/com/google/guava/guava/r03/guava-r03.pom: invalid sha1: expected=1cbd6fab2460050ff7147b6d8536f39c8f535067 computed=7a37041386ee39a1fbb3efd3c4c6932809cb5887 (1304ms) {code} Now, we can probably still get away with removing guava from lib/ -- they just release guava-r6, which should be compatible with the guava-dependent code in Pig, and is supposed to have a proper maven deploy. But the patch as is should not be applied. guava.jar should be removed from the lib folder --- Key: PIG-1500 URL: https://issues.apache.org/jira/browse/PIG-1500 Project: Pig Issue Type: Bug Components: build Reporter: Giridharan Kesavan Assignee: niraj rai Fix For: 0.8.0 Attachments: removeGuavaJar.patch guava jar is available in the maven repository but still its is checked into the pig trunk's lib folder. I ve checked the availability of guava jar in the maven repository. http://mvnrepository.com/artifact/com.google.guava/guava -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1478) Add progress notification listener to PigRunner API
[ https://issues.apache.org/jira/browse/PIG-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890090#action_12890090 ] Dmitriy V. Ryaboy commented on PIG-1478: This seems to fit the bill. Add progress notification listener to PigRunner API --- Key: PIG-1478 URL: https://issues.apache.org/jira/browse/PIG-1478 Project: Pig Issue Type: Improvement Reporter: Richard Ding Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1478.patch PIG-1333 added PigRunner API to allow Pig users and tools to get a status/stats object back after executing a Pig script. The new API, however, is synchronous (blocking). It's known that a Pig script can spawn tens (even hundreds) MR jobs and take hours to complete. Therefore it'll be nice to give progress feedback to the callers during the execution. The proposal is to add an optional parameter to the API: {code} public abstract class PigRunner { public static PigStats run(String[] args, PigProgressNotificationListener listener) {...} } {code} The new listener is defined as following: {code} package org.apache.pig.tools.pigstats; public interface PigProgressNotificationListener extends java.util.EventListener { // just before the launch of MR jobs for the script public void LaunchStartedNotification(int numJobsToLaunch); // number of jobs submitted in a batch public void jobsSubmittedNotification(int numJobsSubmitted); // a job is started public void jobStartedNotification(String assignedJobId); // a job is completed successfully public void jobFinishedNotification(JobStats jobStats); // a job is failed public void jobFailedNotification(JobStats jobStats); // a user output is completed successfully public void outputCompletedNotification(OutputStats outputStats); // updates the progress as percentage public void progressUpdatedNotification(int progress); // the script execution is done public void launchCompletedNotification(int numJobsSucceeded); } {code} Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation
[ https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888942#action_12888942 ] Dmitriy V. Ryaboy commented on PIG-1473: Thejas, do you think there could be any performance gains if we could delay deserialization of the top-level fields in the tuple, but deserialize whole maps or databags if they are touched? Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation - Key: PIG-1473 URL: https://issues.apache.org/jira/browse/PIG-1473 Project: Pig Issue Type: Improvement Affects Versions: 0.8.0 Reporter: Thejas M Nair Assignee: Thejas M Nair Fix For: 0.8.0 Cost of serialization/deserialization (sedes) can be very high and avoiding it will improve performance. Avoid sedes when possible by implementing approach #3 proposed in http://wiki.apache.org/pig/AvoidingSedes . The load function uses subclass of Map and DataBag which holds the serialized copy. LoadFunction delays deserialization of map and bag types until a member function of java.util.Map or DataBag is called. Example of query where this will help - {CODE} l = LOAD 'file1' AS (a : int, b : map [ ]); f = FOREACH l GENERATE udf1(a), b; fil = FILTER f BY $0 5; dump fil; -- Serialization of column b can be delayed until here using this approach . {CODE} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1428) Make a StatusReporter singleton available for incrementing counters
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy closed PIG-1428. -- Make a StatusReporter singleton available for incrementing counters --- Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch, PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1428) Make a StatusReporter singleton available for incrementing counters
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888545#action_12888545 ] Dmitriy V. Ryaboy commented on PIG-1428: I ran the new and changed tests manually before committing, but not the whole set (didn't have 12 hours to spare). Which tests are failing for you? Make a StatusReporter singleton available for incrementing counters --- Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch, PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1434) Allow casting relations to scalars
[ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888546#action_12888546 ] Dmitriy V. Ryaboy commented on PIG-1434: +1 for casting as tuple. Though it may have to look like {code} Y = foreach Z generate X::$1/(long) ((tuple)C).count, X::$2 - (long) ((tuple)C).max; {code} Definitely -1 on the bracket syntax.. it seems very non-intuitive. Allow casting relations to scalars -- Key: PIG-1434 URL: https://issues.apache.org/jira/browse/PIG-1434 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: scalarImpl.patch This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801. The proposal is to allow casting relations to scalar types in foreach. Example: A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A); . X = Y = foreach X generate $1/(long) C; Couple of additional comments: (1) You can only cast relations including a single value or an error will be reported (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence. (3) Y will look for C closest to it. Implementation thoughts: The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to (1) Store C (2) convert the cast to the UDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1428) Make a StatusReporter singleton available for incrementing counters
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888588#action_12888588 ] Dmitriy V. Ryaboy commented on PIG-1428: Found the culprit, will commit fix within ~ 20 mins assuming tests pass. Make a StatusReporter singleton available for incrementing counters --- Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch, PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1428) Make a StatusReporter singleton available for incrementing counters
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888599#action_12888599 ] Dmitriy V. Ryaboy commented on PIG-1428: yeah that's the patch I have, verbatim. Sorry about breaking the build again. Make a StatusReporter singleton available for incrementing counters --- Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: npe.patch, PIG-1428.patch, PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression
[ https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy closed PIG-1449. -- RegExLoader hangs on lines that don't match the regular expression -- Key: PIG-1449 URL: https://issues.apache.org/jira/browse/PIG-1449 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Justin Sanders Priority: Minor Fix For: 0.8.0 Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, RegExLoader.patch In the 0.7.0 changes to RegExLoader there was a bug introduced where the code will stay in the while loop if the line isn't matched. Before 0.7.0 these lines would be skipped if they didn't match the regular expression. The result is the mapper will not respond and will time out with Task attempt_X failed to report status for 600 seconds. Killing!. Here are the steps to recreate the bug: Create a text file in HDFS with the following lines: test1 testA test2 Run the following pig script: REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar; test = LOAD '/path/to/test.txt' using org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line); dump test; Expected result: (test1) (test3) Actual result: Job fails to complete after 600 second timeout waiting on the mapper to complete. The mapper hangs at 33% since it can process the first line but gets stuck into the while loop on the second line. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1469) DefaultDataBag assumes ArrayList as default List type
[ https://issues.apache.org/jira/browse/PIG-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1469: --- Status: Resolved (was: Patch Available) Resolution: Fixed I committed this. DefaultDataBag assumes ArrayList as default List type - Key: PIG-1469 URL: https://issues.apache.org/jira/browse/PIG-1469 Project: Pig Issue Type: Bug Components: data Affects Versions: 0.8.0 Reporter: Gianmarco De Francisci Morales Assignee: Gianmarco De Francisci Morales Fix For: 0.8.0 Attachments: PIG-1469.patch In org.apache.pig.data.DefaultDataBag, the field mContents is assumed to be of type ArrayList but the user can actually pass a different List to the constructor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884763#action_12884763 ] Dmitriy V. Ryaboy commented on PIG-928: --- Aniket, the patch does not apply cleanly to trunk, can you rebase it? UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-928: -- Attachment: PIG-928.patch I rebased the patch and made it pull jython down via maven. 2.5.1 doesn't appear to be available right now, so this pulls down 2.5.0. Hope that's ok. Looks like the tabulation is wrong in most of this patch.. someone please hit ctrl-a, ctrl-i next time :). Needless to say, this thing needs tests, desperately. Also imho in order for it to make it into trunk, it should be a compile-time option to support (and pull down) jython or jruby or whatnot, not a default option. Otherwise we are well on our way to making people pull down the internet in order to compile pig. UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-928) UDFs in scripting languages
[ https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884845#action_12884845 ] Dmitriy V. Ryaboy commented on PIG-928: --- Aniket, I already made the changes you need to pull down jython -- take a look at the patch I attached. One more general note -- let's say jython instead of python (in the grammar, the keywords, everywhere), as there may be slight incompatibilities between the two and we want to be clear on what we are using. UDFs in scripting languages --- Key: PIG-928 URL: https://issues.apache.org/jira/browse/PIG-928 Project: Pig Issue Type: New Feature Reporter: Alan Gates Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: calltrace.png, package.zip, PIG-928.patch, pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, RegisterPythonUDFFinale.patch, RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1434) Allow casting relations to scalars
[ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884369#action_12884369 ] Dmitriy V. Ryaboy commented on PIG-1434: A couple of thoughts that came out of the Pig conributor meeting: 1) rather than scalar, we should make this work for single-tuple relations. That way a user can do something like this: {code} A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A) as count, MAX(A.y) as max; . X = Y = foreach X generate $1/(long) C.count, $2-(long) C.max; {code} 2) Writing the intermediate relation to a file can cause hotspots. We should push this into the distributed cache. In cases when the dist. cache is turned off, we can at least increase the replication factor to some large-ish number (10, maybe, like the jobs?) Allow casting relations to scalars -- Key: PIG-1434 URL: https://issues.apache.org/jira/browse/PIG-1434 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: scalarImpl.patch This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801. The proposal is to allow casting relations to scalar types in foreach. Example: A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A); . X = Y = foreach X generate $1/(long) C; Couple of additional comments: (1) You can only cast relations including a single value or an error will be reported (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence. (3) Y will look for C closest to it. Implementation thoughts: The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to (1) Store C (2) convert the cast to the UDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1434) Allow casting relations to scalars
[ https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884502#action_12884502 ] Dmitriy V. Ryaboy commented on PIG-1434: SQL fails at runtime when executing queries that require a single row to be returned. So, oracle won't complain if you do this, for example: {code} SELECT foo.a, (SELECT c FROM bar WHERE foo.a = bar.a) from foo {code} unless the inner select produces more than one row. I think we should adopt the same approach -- assume the query is innocent until proven guilty. -D Allow casting relations to scalars -- Key: PIG-1434 URL: https://issues.apache.org/jira/browse/PIG-1434 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Aniket Mokashi Fix For: 0.8.0 Attachments: scalarImpl.patch This jira is to implement a simplified version of the functionality described in https://issues.apache.org/jira/browse/PIG-801. The proposal is to allow casting relations to scalar types in foreach. Example: A = load 'data' as (x, y, z); B = group A all; C = foreach B generate COUNT(A); . X = Y = foreach X generate $1/(long) C; Couple of additional comments: (1) You can only cast relations including a single value or an error will be reported (2) Name resolution is needed since relation X might have field named C in which case that field takes precedence. (3) Y will look for C closest to it. Implementation thoughts: The idea is to store C into a file and then convert it into scalar via a UDF. I believe we already have a UDF that Ben Reed contributed for this purpose. Most of the work would be to update the logical plan to (1) Store C (2) convert the cast to the UDF -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1427) Monitor and kill runaway UDFs
[ https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1427: --- Status: Patch Available (was: Open) Monitor and kill runaway UDFs - Key: PIG-1427 URL: https://issues.apache.org/jira/browse/PIG-1427 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, PIG-1427.diff, PIG-1427.diff, PIG-1427.diff As a safety measure, it is sometimes useful to monitor UDFs as they execute. It is often preferable to return null or some other default value instead of timing out a runaway evaluation and killing a job. We have in the past seen complex regular expressions lead to job failures due to just half a dozen (out of millions) particularly obnoxious strings. It would be great to give Pig users a lightweight way of enabling UDF monitoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1427) Monitor and kill runaway UDFs
[ https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1427: --- Attachment: PIG-1427.diff Final version of the patch. Monitor and kill runaway UDFs - Key: PIG-1427 URL: https://issues.apache.org/jira/browse/PIG-1427 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, PIG-1427.diff, PIG-1427.diff, PIG-1427.diff As a safety measure, it is sometimes useful to monitor UDFs as they execute. It is often preferable to return null or some other default value instead of timing out a runaway evaluation and killing a job. We have in the past seen complex regular expressions lead to job failures due to just half a dozen (out of millions) particularly obnoxious strings. It would be great to give Pig users a lightweight way of enabling UDF monitoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1427) Monitor and kill runaway UDFs
[ https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1427: --- Status: Resolved (was: Patch Available) Fix Version/s: 0.8.0 Resolution: Fixed Committed. Monitor and kill runaway UDFs - Key: PIG-1427 URL: https://issues.apache.org/jira/browse/PIG-1427 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, PIG-1427.diff, PIG-1427.diff, PIG-1427.diff As a safety measure, it is sometimes useful to monitor UDFs as they execute. It is often preferable to return null or some other default value instead of timing out a runaway evaluation and killing a job. We have in the past seen complex regular expressions lead to job failures due to just half a dozen (out of millions) particularly obnoxious strings. It would be great to give Pig users a lightweight way of enabling UDF monitoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1333) API interface to Pig
[ https://issues.apache.org/jira/browse/PIG-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881513#action_12881513 ] Dmitriy V. Ryaboy commented on PIG-1333: +1 API interface to Pig Key: PIG-1333 URL: https://issues.apache.org/jira/browse/PIG-1333 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1333.patch, PIG-1333_1.patch, PIG-1333_2.patch, PIG-1333_3.patch It would be nice to make Pig more friendly for applications like workflow that would be executing pig scripts on user behalf. Currently, they would have to use pig command line to execute the code; however, this has limitation on the kind of output that would be delivered. For instance, it is hard to produce error information that is easy to use programatically or collect statistics. The proposal is to create a class that mimics the behavior of the Main but gives users a status object back. The the main code of pig would look somethig like: public static void main(String args[]) { PigStatus ps = PigMain.exec(args); exit (PigStatus.rc); } We need to define the following: - Content of PigStatus. It should at least include * return code * error string * exception * statistics - A way to propagate the status class through pig code -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1428) Make a StatusReporter singleton available for incrementing counters
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1428: --- Status: Resolved (was: Patch Available) Resolution: Fixed Tags: pig-0.7.1 Committed to trunk. We may want to consider this for a 0.7.1, if such a thing comes about, as in a sense it's addressing a regression. I tagged this issue with pig-0.7.1 so we can find it later if we decide a dot-release is warranted. Make a StatusReporter singleton available for incrementing counters --- Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch, PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1428) Make a StatusReporter singleton available for incrementing counters
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1428: --- Summary: Make a StatusReporter singleton available for incrementing counters (was: Add getPigStatusReporter() to PigHadoopLogger) Patch Info: [Patch Available] Make a StatusReporter singleton available for incrementing counters --- Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch, PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1333) API interface to Pig
[ https://issues.apache.org/jira/browse/PIG-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878732#action_12878732 ] Dmitriy V. Ryaboy commented on PIG-1333: bq. I'm not sure we should make all Hadoop counters available through the new API. How useful will it be to the users? I'm open to suggestions. Can't speak for other users, but we use counters quite a bit with Elephant Bird and some internal code for keeping track of timed out service requests, unparsable records, and more. The @MonitoredUDF annotation I proposed in PIG-1427 uses counters to report on runaway udfs that get killed. I think the question isn't so much why would you expose them, as why wouldn't you expose them... API interface to Pig Key: PIG-1333 URL: https://issues.apache.org/jira/browse/PIG-1333 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1333.patch, PIG-1333_1.patch It would be nice to make Pig more friendly for applications like workflow that would be executing pig scripts on user behalf. Currently, they would have to use pig command line to execute the code; however, this has limitation on the kind of output that would be delivered. For instance, it is hard to produce error information that is easy to use programatically or collect statistics. The proposal is to create a class that mimics the behavior of the Main but gives users a status object back. The the main code of pig would look somethig like: public static void main(String args[]) { PigStatus ps = PigMain.exec(args); exit (PigStatus.rc); } We need to define the following: - Content of PigStatus. It should at least include * return code * error string * exception * statistics - A way to propagate the status class through pig code -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1333) API interface to Pig
[ https://issues.apache.org/jira/browse/PIG-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878826#action_12878826 ] Dmitriy V. Ryaboy commented on PIG-1333: Yup. API interface to Pig Key: PIG-1333 URL: https://issues.apache.org/jira/browse/PIG-1333 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1333.patch, PIG-1333_1.patch It would be nice to make Pig more friendly for applications like workflow that would be executing pig scripts on user behalf. Currently, they would have to use pig command line to execute the code; however, this has limitation on the kind of output that would be delivered. For instance, it is hard to produce error information that is easy to use programatically or collect statistics. The proposal is to create a class that mimics the behavior of the Main but gives users a status object back. The the main code of pig would look somethig like: public static void main(String args[]) { PigStatus ps = PigMain.exec(args); exit (PigStatus.rc); } We need to define the following: - Content of PigStatus. It should at least include * return code * error string * exception * statistics - A way to propagate the status class through pig code -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1428: --- Attachment: PIG-1428.patch Once more, with feeling. This implements Ashutosh's suggestion of making PigStatusReporter maintain a singleton and expose a public getInstance() method. Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch, PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1427) Monitor and kill runaway UDFs
[ https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1427: --- Attachment: PIG-1427.diff Slightly modified to match the patch in PIG-1428 Monitor and kill runaway UDFs - Key: PIG-1427 URL: https://issues.apache.org/jira/browse/PIG-1427 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, PIG-1427.diff, PIG-1427.diff As a safety measure, it is sometimes useful to monitor UDFs as they execute. It is often preferable to return null or some other default value instead of timing out a runaway evaluation and killing a job. We have in the past seen complex regular expressions lead to job failures due to just half a dozen (out of millions) particularly obnoxious strings. It would be great to give Pig users a lightweight way of enabling UDF monitoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1440) Refactor org.apache.pig.data.DataType to use Enums instead of integer constants
[ https://issues.apache.org/jira/browse/PIG-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876011#action_12876011 ] Dmitriy V. Ryaboy commented on PIG-1440: I think Enums are great for this, and have wished many a time that the types were Enums while working with Pig. I do want to point out, though, that this will affect a lot of user code -- any EvalFunc that specifies a schema, any loadfunc that implements the metadata options, etc. Are we willing to break things for our users so soon after 0.7? Refactor org.apache.pig.data.DataType to use Enums instead of integer constants --- Key: PIG-1440 URL: https://issues.apache.org/jira/browse/PIG-1440 Project: Pig Issue Type: Improvement Reporter: Gianmarco De Francisci Morales Priority: Minor Refactoring DataType to use Enums instead of integer constants would provide many benefits, including: * Cleaner code * Easier to iterate over Enums * Easier to add new Enums without braking backwards compatibility * Can use EnumMaps for easily link values to Enums * Better support for translation from Enums to Strings and viceversa Int (or byte in Pig's case) Enum pattern has several drawbacks as summarized here http://java.sun.com/j2se/1.5.0/docs/guide/language/enums.html Drawbacks: We have to explicitly convert Enum values to bytes when serializing. This can be done in DataReaderWriter. Possibly higher overhead than simply using bytes. Refactoring might be difficult. Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1428: --- Status: Open (was: Patch Available) trying to tickle hudson Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1428: --- Status: Patch Available (was: Open) Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs
[ https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876093#action_12876093 ] Dmitriy V. Ryaboy commented on PIG-1427: Ashutosh, Alan, et al: review please. Monitor and kill runaway UDFs - Key: PIG-1427 URL: https://issues.apache.org/jira/browse/PIG-1427 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, PIG-1427.diff As a safety measure, it is sometimes useful to monitor UDFs as they execute. It is often preferable to return null or some other default value instead of timing out a runaway evaluation and killing a job. We have in the past seen complex regular expressions lead to job failures due to just half a dozen (out of millions) particularly obnoxious strings. It would be great to give Pig users a lightweight way of enabling UDF monitoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1333) API interface to Pig
[ https://issues.apache.org/jira/browse/PIG-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876111#action_12876111 ] Dmitriy V. Ryaboy commented on PIG-1333: That's a heck of a patch. I am really looking forward to having this available. Not sure you need the map in the PIG_FEATURE enum. You can get an enum by offset using PIG_FEATURE.values(), so no need for the constructor; you can get a string representation using pigFeature.name() or pigFeature.toString(), so no need for getString(); and you can get the ordinal using pigFeature.ordinal(). Granted, the ordinals are 0-based, but you can just throw in an dummy value for the 0th spot to preserve all the offsets as they are. I see that you explicitly pull out the known and enumerated Pig counters. Any reason not to make all other job counters available as well via the same interface? API interface to Pig Key: PIG-1333 URL: https://issues.apache.org/jira/browse/PIG-1333 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Richard Ding Fix For: 0.8.0 Attachments: PIG-1333.patch It would be nice to make Pig more friendly for applications like workflow that would be executing pig scripts on user behalf. Currently, they would have to use pig command line to execute the code; however, this has limitation on the kind of output that would be delivered. For instance, it is hard to produce error information that is easy to use programatically or collect statistics. The proposal is to create a class that mimics the behavior of the Main but gives users a status object back. The the main code of pig would look somethig like: public static void main(String args[]) { PigStatus ps = PigMain.exec(args); exit (PigStatus.rc); } We need to define the following: - Content of PigStatus. It should at least include * return code * error string * exception * statistics - A way to propagate the status class through pig code -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874865#action_12874865 ] Dmitriy V. Ryaboy commented on PIG-1428: I notice that the issue has been discussed before in PIG-889, and Santosh argued (convincingly) that adding this method to PigLogger might not make sense. Santosh, would you like to suggest a different place to put this functionality? I am not married to using this method, it's just the path of least resistance. Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12873908#action_12873908 ] Dmitriy V. Ryaboy commented on PIG-1428: Findbugs is quite right to call me out on the synchronization thing. I am not sure why the setter needs to by synchronized; I am even less sure the getter should be. Seems like this would add one more lock every time we want to increment a counter or write a log line, which is unfortunate (I assume those objects handle their own concurrency issues). Can Richard or Pradeep comment on that? Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1428: --- Status: Open (was: Patch Available) Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1428: --- Status: Patch Available (was: Open) Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1428: --- Attachment: PIG-1428.patch removed the synchronized keyword Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch, PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy reassigned PIG-1428: -- Assignee: Dmitriy V. Ryaboy Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Assignee: Dmitriy V. Ryaboy Fix For: 0.8.0 Attachments: PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1428: --- Attachment: PIG-1428.patch No tests, as this is trivial. Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Fix For: 0.8.0 Attachments: PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger
[ https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1428: --- Status: Patch Available (was: Open) please review if this gets no -1s other than lack of tests. Add getPigStatusReporter() to PigHadoopLogger - Key: PIG-1428 URL: https://issues.apache.org/jira/browse/PIG-1428 Project: Pig Issue Type: Bug Affects Versions: 0.7.0 Reporter: Ashutosh Chauhan Fix For: 0.8.0 Attachments: PIG-1428.patch Without this getter method, its not possible to get counters, report progress etc. from UDFs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1427) Monitor and kill runaway UDFs
[ https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy V. Ryaboy updated PIG-1427: --- Attachment: guava-r03.jar Attaching the guava jar that needs to be placed in lib/ in order to test this. It is theoretically available via maven, but at the moment the deploy to maven is misconfigured and unfetchable (see above reference). The guava library is licensed under Apache 2.0: http://code.google.com/p/guava-libraries/ Monitor and kill runaway UDFs - Key: PIG-1427 URL: https://issues.apache.org/jira/browse/PIG-1427 Project: Pig Issue Type: New Feature Affects Versions: 0.8.0 Reporter: Dmitriy V. Ryaboy Assignee: Dmitriy V. Ryaboy Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch As a safety measure, it is sometimes useful to monitor UDFs as they execute. It is often preferable to return null or some other default value instead of timing out a runaway evaluation and killing a job. We have in the past seen complex regular expressions lead to job failures due to just half a dozen (out of millions) particularly obnoxious strings. It would be great to give Pig users a lightweight way of enabling UDF monitoring. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.