[jira] Resolved: (PIG-916) Change the pig hbase interface to get more than one row at a time when scanning

2010-09-21 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy resolved PIG-916.
---

Fix Version/s: 0.8.0
   Resolution: Duplicate

Fixed in PIG-1205

 Change the pig hbase interface to get more than one row at a time when 
 scanning
 ---

 Key: PIG-916
 URL: https://issues.apache.org/jira/browse/PIG-916
 Project: Pig
  Issue Type: Improvement
Reporter: Alex Newman
Assignee: Dmitriy V. Ryaboy
Priority: Trivial
 Fix For: 0.8.0


 It should be significantly faster to get numerous rows at the same time 
 rather than one row at a time for large table extraction processes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1611) use enums for error code

2010-09-14 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12909493#action_12909493
 ] 

Dmitriy V. Ryaboy commented on PIG-1611:


+140

 use enums for error code
 

 Key: PIG-1611
 URL: https://issues.apache.org/jira/browse/PIG-1611
 Project: Pig
  Issue Type: Sub-task
Reporter: Thejas M Nair
 Fix For: 0.9.0


 Pig code is using integer constants for error code, and the value of the 
 error code is reserved using 
 http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification .
 This process is cumbersome and error prone.
 It will be better to use enum values instead. The enum value can contain the 
 error message and encapsulate the error code. 
 For example -
 {code}
 Replace 
 throw new SchemaMergeException(Error in merging schema, 2124, 
 PigException.BUG); 
 with
 throw new SchemaMergeException(SCHEMA_MERGE_EX, PigException.BUG); 
 {code}
 Where SCHEMA_MERGE_EX belongs to a error codes enum. We can use the ordinal 
 value of the enum and an offset to determine the error code. 
 The error code will be passed through the constructor of the enum.
 {code}
 SCHEMA_MERGE_EX(Error in merging schema);
 {code}
 For documentation, the error code and error messages can be dumped using code 
 that uses the enum error code class.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1602) The .classpath of eclipse template still use hbase-0.20.0

2010-09-06 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906451#action_12906451
 ] 

Dmitriy V. Ryaboy commented on PIG-1602:


+1

 The .classpath of eclipse template still use hbase-0.20.0
 -

 Key: PIG-1602
 URL: https://issues.apache.org/jira/browse/PIG-1602
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG_1602.patch


 The .classpath of eclipse template still use hbase-0.20.0, it should be 
 updated to hbase-0.20.6

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1597) Development snapshot jar no longer picked up by bin/pig

2010-09-04 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1597:
---

   Status: Resolved  (was: Patch Available)
Fix Version/s: 0.9.0
   Resolution: Fixed

Committed.

 Development snapshot jar no longer picked up by bin/pig
 ---

 Key: PIG-1597
 URL: https://issues.apache.org/jira/browse/PIG-1597
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0, 0.9.0

 Attachments: PIG_1597.patch


 As George Stathis poined out in PIG-1596, bin/pig no longer picks up 
 development pig jars. This appears to have been introduced in PIG-1334, as 
 the jar was renamed from -dev- to -SNAPSHOT-

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1596) NPE's thrown when attempting to load hbase columns containing null values

2010-09-04 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12906305#action_12906305
 ] 

Dmitriy V. Ryaboy commented on PIG-1596:


+1

before committing, do you mind combining the new test with one of the existing 
ones? Trying to keep the test suite at under 24 hours :)

 NPE's thrown when attempting to load hbase columns containing null values
 -

 Key: PIG-1596
 URL: https://issues.apache.org/jira/browse/PIG-1596
 Project: Pig
  Issue Type: Bug
  Components: data
Affects Versions: 0.7.0
Reporter: George P. Stathis
 Fix For: 0.8.0, 0.9.0

 Attachments: null_hbase_records.patch, PIG_1596.patch, 
 PIG_1596_2.patch


 I'm not a committer, but I'd like to suggest the attached patch to handle 
 loading hbase rows containing null cell values (since hbase is all about 
 sparsly populated data rows). As it stands, a DataByteArray can be created 
 with a null mData if a cell has no value, which causes NPEs by simply 
 attempting to load a row containing the null cell in question.
 PS: the attached patch also contains a slight change to the bin/pig 
 executable to point to the build/pig\-\*\-SNAPSHOT.jar and not the 
 build/pig\-\*\-dev.jar (the latter no longer seems to exist). If you prefer a 
 separate patch for this, I'll be happy to submit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1596) NPE's thrown when attempting to load hbase columns containing null values

2010-09-03 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905818#action_12905818
 ] 

Dmitriy V. Ryaboy commented on PIG-1596:


I will review on Friday.

 NPE's thrown when attempting to load hbase columns containing null values
 -

 Key: PIG-1596
 URL: https://issues.apache.org/jira/browse/PIG-1596
 Project: Pig
  Issue Type: Bug
  Components: data
Affects Versions: 0.7.0
Reporter: George P. Stathis
 Fix For: 0.8.0, 0.9.0

 Attachments: null_hbase_records.patch, PIG_1596.patch, 
 PIG_1596_2.patch


 I'm not a committer, but I'd like to suggest the attached patch to handle 
 loading hbase rows containing null cell values (since hbase is all about 
 sparsly populated data rows). As it stands, a DataByteArray can be created 
 with a null mData if a cell has no value, which causes NPEs by simply 
 attempting to load a row containing the null cell in question.
 PS: the attached patch also contains a slight change to the bin/pig 
 executable to point to the build/pig\-\*\-SNAPSHOT.jar and not the 
 build/pig\-\*\-dev.jar (the latter no longer seems to exist). If you prefer a 
 separate patch for this, I'll be happy to submit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1597) Development snapshot jar no longer picked up by bin/pig

2010-09-03 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905819#action_12905819
 ] 

Dmitriy V. Ryaboy commented on PIG-1597:


Does anyone object to me just committing this to 0.8 and trunk?

 Development snapshot jar no longer picked up by bin/pig
 ---

 Key: PIG-1597
 URL: https://issues.apache.org/jira/browse/PIG-1597
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1597.patch


 As George Stathis poined out in PIG-1596, bin/pig no longer picks up 
 development pig jars. This appears to have been introduced in PIG-1334, as 
 the jar was renamed from -dev- to -SNAPSHOT-

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-09-02 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905612#action_12905612
 ] 

Dmitriy V. Ryaboy commented on PIG-794:
---

Doug and Scott will know better of course, but afaik, Avro doesn't support 
Object keys.

You can cheat and turn Object keys into strings by Base64-encoding their 
serialized representations.. you'd have to know to reverse the process when 
deserializing, though.

Or we can try to get rid of InternalMap.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-09-02 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905775#action_12905775
 ] 

Dmitriy V. Ryaboy commented on PIG-794:
---

Jeff, that's what I am saying -- since they are writables, we can turn them 
into strings and not need InternalMap at all.

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroStorage_4.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1592) ORDER BY distribution is uneven when record size is correlated with order key

2010-09-01 Thread Dmitriy V. Ryaboy (JIRA)
ORDER BY distribution is uneven when record size is correlated with order key
-

 Key: PIG-1592
 URL: https://issues.apache.org/jira/browse/PIG-1592
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.9.0


The partitioner contributed in PIG-545 distributes the order key space between 
partitions so that each partition gets approximately the same number of keys, 
even when the keys have a non-uniform distribution over the key space.

Unfortunately this still allows for severe partition imbalance when record size 
is correlated with the order key. By way of motivating example, consider this 
script which attempts to produce a list of genuses based on how many species 
each genus contains:

{code}
set default_parallel 60;
critters = load 'biodata'' as (genus, species);
genus_counts = foreach (group critters by genus) generate group as genus, 
COUNT(critters) as num_species, critters;
ordered_genuses = order genus_counts by num_species desc;
store ordered_genuses
{code}

The higher the value of genus_counts, the more species tuples will be contained 
in the critters bag, the wider the row. This can cause a severe processing 
imbalance, as the partitioner processing the records with the highest values of 
genus_counts will have the same number of *records* as the partitioner 
processing the lowest number, but it will have far more actual *bytes* to work 
on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1592) ORDER BY distribution is uneven when record size is correlated with order key

2010-09-01 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905220#action_12905220
 ] 

Dmitriy V. Ryaboy commented on PIG-1592:


One proposal is to simply change the default weighted range partitioner to take 
into account the record size. If record size is uniform, or uniformly 
distributed, or non-uniformly distributed but independent of the order key, 
this change shouldn't materially affect the distributions created for data sets 
not covered by this issue.

 ORDER BY distribution is uneven when record size is correlated with order key
 -

 Key: PIG-1592
 URL: https://issues.apache.org/jira/browse/PIG-1592
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.9.0


 The partitioner contributed in PIG-545 distributes the order key space 
 between partitions so that each partition gets approximately the same number 
 of keys, even when the keys have a non-uniform distribution over the key 
 space.
 Unfortunately this still allows for severe partition imbalance when record 
 size is correlated with the order key. By way of motivating example, consider 
 this script which attempts to produce a list of genuses based on how many 
 species each genus contains:
 {code}
 set default_parallel 60;
 critters = load 'biodata'' as (genus, species);
 genus_counts = foreach (group critters by genus) generate group as genus, 
 COUNT(critters) as num_species, critters;
 ordered_genuses = order genus_counts by num_species desc;
 store ordered_genuses
 {code}
 The higher the value of genus_counts, the more species tuples will be 
 contained in the critters bag, the wider the row. This can cause a severe 
 processing imbalance, as the partitioner processing the records with the 
 highest values of genus_counts will have the same number of *records* as the 
 partitioner processing the lowest number, but it will have far more actual 
 *bytes* to work on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1596) NPE's thrown when attempting to load hbase columns containing null values

2010-09-01 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12905404#action_12905404
 ] 

Dmitriy V. Ryaboy commented on PIG-1596:


Jeff,
I think it's clearer if you insert null into the tuple, not an empty 
DataByteArray (and assertNull in the test)

George, the SNAPSHOT thing is a real bug, thanks for catching that, this 
happened when pig was made available through maven in PIG-1334.

I'll create a separate ticket for that.

 NPE's thrown when attempting to load hbase columns containing null values
 -

 Key: PIG-1596
 URL: https://issues.apache.org/jira/browse/PIG-1596
 Project: Pig
  Issue Type: Bug
  Components: data
Affects Versions: 0.7.0
Reporter: George P. Stathis
 Fix For: 0.8.0, 0.9.0

 Attachments: null_hbase_records.patch, PIG_1596.patch


 I'm not a committer, but I'd like to suggest the attached patch to handle 
 loading hbase rows containing null cell values (since hbase is all about 
 sparsly populated data rows). As it stands, a DataByteArray can be created 
 with a null mData if a cell has no value, which causes NPEs by simply 
 attempting to load a row containing the null cell in question.
 PS: the attached patch also contains a slight change to the bin/pig 
 executable to point to the build/pig\-\*\-SNAPSHOT.jar and not the 
 build/pig\-\*\-dev.jar (the latter no longer seems to exist). If you prefer a 
 separate patch for this, I'll be happy to submit it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1597) Development snapshot jar no longer picked up by bin/pig

2010-09-01 Thread Dmitriy V. Ryaboy (JIRA)
Development snapshot jar no longer picked up by bin/pig
---

 Key: PIG-1597
 URL: https://issues.apache.org/jira/browse/PIG-1597
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0


As George Stathis poined out in PIG-1596, bin/pig no longer picks up 
development pig jars. This appears to have been introduced in PIG-1334, as the 
jar was renamed from -dev- to -SNAPSHOT-

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1597) Development snapshot jar no longer picked up by bin/pig

2010-09-01 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1597:
---

Status: Patch Available  (was: Open)

 Development snapshot jar no longer picked up by bin/pig
 ---

 Key: PIG-1597
 URL: https://issues.apache.org/jira/browse/PIG-1597
 Project: Pig
  Issue Type: Bug
  Components: grunt
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1597.patch


 As George Stathis poined out in PIG-1596, bin/pig no longer picks up 
 development pig jars. This appears to have been introduced in PIG-1334, as 
 the jar was renamed from -dev- to -SNAPSHOT-

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-794) Use Avro serialization in Pig

2010-08-31 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904615#action_12904615
 ] 

Dmitriy V. Ryaboy commented on PIG-794:
---

Jeff, have you checkoed out Scott Carey's work here: 
https://issues.apache.org/jira/browse/AVRO-592 ?

 Use Avro serialization in Pig
 -

 Key: PIG-794
 URL: https://issues.apache.org/jira/browse/PIG-794
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.2.0
Reporter: Rakesh Setty
Assignee: Dmitriy V. Ryaboy
 Attachments: avro-0.1-dev-java_r765402.jar, AvroStorage.patch, 
 AvroStorage_2.patch, AvroStorage_3.patch, AvroTest.java, 
 jackson-asl-0.9.4.jar, PIG-794.patch


 We would like to use Avro serialization in Pig to pass data between MR jobs 
 instead of the current BinStorage. Attached is an implementation of 
 AvroBinStorage which performs significantly better compared to BinStorage on 
 our benchmarks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-30 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1205:
---

Attachment: PIG_1205_9.patch

Patch with the StoreCaster changes as suggested by Alan. With +1s from Alan and 
Jeff, committing.

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: hbase-0.20.6-test.jar, hbase-0.20.6.jar, PIG_1205.patch, 
 PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, 
 PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch, PIG_1205_9.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-30 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12904325#action_12904325
 ] 

Dmitriy V. Ryaboy commented on PIG-1205:


Re HBASE-1933, they are publishing snapshots of current trunk, not the 0.20 
branch. We'll be able to start using maven to pull down hbase when we upgrade 
to their 0.9 release (which iirc depends on hdfs appends...)

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: hbase-0.20.6-test.jar, hbase-0.20.6.jar, PIG_1205.patch, 
 PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, 
 PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch, PIG_1205_9.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-30 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1205:
---

  Status: Resolved  (was: Patch Available)
Release Note: 
HBaseStorage has been significantly reworked with this release.

Usage:
{code}
my_data = LOAD 'hbase://table_name' USING 
org.apache.pig.backend.hadoop.hbase.HBaseStorage('colfamily:col1 
colfamily:col2', '-caching 100') as (col1:int, col2:chararray);

STORE my_date INTO 'hbaseL//other_table' USING 
org.apache.pig.backend.hadoop.hbase.HBaseStorage('colfamily:col1 
colfamily:col2');
{code}

HBaseStorage can now write data into HBase as well as read it. The first 
argument is a space-delimited list of columns to be loaded (or stored). Columns 
are specified as columnfamily:column_name. The second argument is an optional 
set of key-value pairs used to control HBaseStorage behavior. Available 
arguments are:

* {{monospaced}}-loadKey{{monospaced}} Used to load the row key; false by 
default. If true, the first field in the returned tuple will be the value of 
the row key.
* {{monospaced}}-gt, -gte, -lt, and -lte{{monospaced}} Used to specify bounds 
on row keys to be scanned. The keys are specified as binary data, using the hex 
representation. Any slashes have to be double-escaped (two slashes per single 
real slash) to be parsed correctly.
* {{monospaced}}-caching{{monospaced}} Used to specify the number of rows to be 
cached per HBase RPC call. See 
http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#setScannerCaching%28int%29
 for more information about this HBase feature.
* {{monospaced}}-limit{{monospaced}} Used to control how many rows *per scanned 
region* will be retrieved. This can of course speed up processing if you just 
want a few rows. The total number of rows returned will be up to number of 
regions * limit. The limit is applied after any -gt, -lt, etc filters. Pig's 
LIMIT operator can be used in conjunction with this argument.
* {{monospaced}}-caster{{monospaced}} Used to specify a LoadCaster (or 
LoadStoreCaster, for storage) used to convert the data stored in HBase into Pig 
data. By default, the Utf8StorageConverter is used, which stores all data as 
its string representation. The string HBaseBinaryConverter can be used to 
specify that data is stored in HBase's native binary format. Note that the 
HBaseBinary converter does not work with complex data types such as maps, 
tuples, and bags. You can also specify a full class path such as 
org.apache.pig.backend.hadoop.hbase.HBaseBinaryConverter to use your own 
Caster. The default caster can be changed by setting the pig.hbase.caster 
property in pig,properties

HBaseStorage matches column arguments to tuple fields based on their ordinal 
position. When storing, the first field is expected to be the key value.
  Resolution: Fixed

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: hbase-0.20.6-test.jar, hbase-0.20.6.jar, PIG_1205.patch, 
 PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, 
 PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch, PIG_1205_9.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-29 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1205:
---

Attachment: hbase-0.20.6.jar
hbase-0.20.6-test.jar

Attaching the hbase-0.20.6 jars

HBase is an apache project, so no license issues.

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: hbase-0.20.6-test.jar, hbase-0.20.6.jar, PIG_1205.patch, 
 PIG_1205_2.patch, PIG_1205_3.patch, PIG_1205_4.patch, PIG_1205_5.path, 
 PIG_1205_6.patch, PIG_1205_7.patch, PIG_1205_8.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1150) VAR() Variance UDF

2010-08-27 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903634#action_12903634
 ] 

Dmitriy V. Ryaboy commented on PIG-1150:


I won't have time before the 30th. 

BTW one doesn't even need a udf if using the sum of squares approach.. :-) just 
generate the square and the sum in the foreach (it will perform the algebraic 
decomposition automatically)

 VAR() Variance UDF
 --

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: var.patch


 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
 variance in a distributed manner, based on the AVG() builtin.  It works by 
 calculating the count, sum and sum of squares, as described here: 
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
 Is this a worthwhile contribution?  Taking the square root of this value 
 using the contrib SQRT() function gives Standard Deviation, which is missing 
 from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1563) SUBSTRING function is broken

2010-08-27 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903636#action_12903636
 ] 

Dmitriy V. Ryaboy commented on PIG-1563:


Sounds good.  Should we just merge in the amazon contrib for some of these?

 SUBSTRING function is broken
 

 Key: PIG-1563
 URL: https://issues.apache.org/jira/browse/PIG-1563
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1563.patch


 Script:
 A = load 'studenttab10k' as (name, age, gpa);
 C = foreach A generate SUBSTRING(name, 0,5);
 E = limit C 10;
 dump E;
 Output is always empty:
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1150) VAR() Variance UDF

2010-08-27 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903643#action_12903643
 ] 

Dmitriy V. Ryaboy commented on PIG-1150:


Yeah I think it's not a big deal if we are splitting piggybank out soon anyway.

 VAR() Variance UDF
 --

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: var.patch


 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
 variance in a distributed manner, based on the AVG() builtin.  It works by 
 calculating the count, sum and sum of squares, as described here: 
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
 Is this a worthwhile contribution?  Taking the square root of this value 
 using the contrib SQRT() function gives Standard Deviation, which is missing 
 from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1563) SUBSTRING function is broken

2010-08-27 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903644#action_12903644
 ] 

Dmitriy V. Ryaboy commented on PIG-1563:


Olga, the amazon contrib is PIG-1565

 SUBSTRING function is broken
 

 Key: PIG-1563
 URL: https://issues.apache.org/jira/browse/PIG-1563
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1563.patch


 Script:
 A = load 'studenttab10k' as (name, age, gpa);
 C = foreach A generate SUBSTRING(name, 0,5);
 E = limit C 10;
 dump E;
 Output is always empty:
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1564) add support for multiple filesystems

2010-08-27 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903728#action_12903728
 ] 

Dmitriy V. Ryaboy commented on PIG-1564:


Andrew, does 'fs -cd s3://anhi-test-data/' work?

The cd command is also deprecated (though not marked as such) :)

 add support for multiple filesystems
 

 Key: PIG-1564
 URL: https://issues.apache.org/jira/browse/PIG-1564
 Project: Pig
  Issue Type: Improvement
Reporter: Andrew Hitchcock
 Attachments: PIG-1564-1.patch


 Currently you can't run Pig scripts that read data from one file system and 
 write it to another. Also, Grunt doesn't support CDing from one directory to 
 another on different file systems.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1563) SUBSTRING function is broken

2010-08-27 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903753#action_12903753
 ] 

Dmitriy V. Ryaboy commented on PIG-1563:


+1

question/comment -- any reason you discarded the new buildSimpleFuncSpec I 
wrote in the first iteration of this patch? I think it simplifies the code:

{code}
funcList.add(Utils.buildSimpleFuncSpec(
  this.getClass().getName(), DataType.CHARARRAY, DataType.CHARARRAY));
{code}

vs
{code}
Schema s = new Schema();
s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
s.add(new Schema.FieldSchema(null, DataType.CHARARRAY));
funcList.add(new FuncSpec(this.getClass().getName(), s));
{code}

 SUBSTRING function is broken
 

 Key: PIG-1563
 URL: https://issues.apache.org/jira/browse/PIG-1563
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1563.patch, PIG_1563_v2.patch


 Script:
 A = load 'studenttab10k' as (name, age, gpa);
 C = foreach A generate SUBSTRING(name, 0,5);
 E = limit C 10;
 dump E;
 Output is always empty:
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1555) [piggybank] add CSV Loader

2010-08-26 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1555:
---

  Status: Resolved  (was: Patch Available)
Release Note: 
CSVLoader can be used to load comma-separated value files.
It properly handles commas included inside quoted fields, and quotes escaped by 
preceding them with another quote character (Excel-style).
CSVLoader only handle single-line entries; quoting a multi-line value will 
*not* work.
  Resolution: Fixed

 [piggybank] add CSV Loader
 --

 Key: PIG-1555
 URL: https://issues.apache.org/jira/browse/PIG-1555
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG_1555.patch


 Users often ask for a CSV loader that can handle quoted commas. Let's get 'er 
 done.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-26 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12903031#action_12903031
 ] 

Dmitriy V. Ryaboy commented on PIG-1518:


This is a great feature, thanks Yan.

Could you comment on what the final solution was as far as PigStorage and 
OrderedLoadFunc? I see two ideas (yours and Ashutosh's) in the discussion, but 
not what the ultimate direction you took was.

 multi file input format for loaders
 ---

 Key: PIG-1518
 URL: https://issues.apache.org/jira/browse/PIG-1518
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
 PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch


 We frequently run in the situation where Pig needs to deal with small files 
 in the input. In this case a separate map is created for each file which 
 could be very inefficient. 
 It would be greate to have an umbrella input format that can take multiple 
 files and use them in a single split. We would like to see this working with 
 different data formats if possible.
 There are already a couple of input formats doing similar thing: 
 MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
 with ne Hadoop 20 API. 
 We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1563) SUBSTRING function is broken

2010-08-25 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1563:
---

   Status: Patch Available  (was: Open)
Affects Version/s: 0.8.0

 SUBSTRING function is broken
 

 Key: PIG-1563
 URL: https://issues.apache.org/jira/browse/PIG-1563
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Olga Natkovich
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1563.patch


 Script:
 A = load 'studenttab10k' as (name, age, gpa);
 C = foreach A generate SUBSTRING(name, 0,5);
 E = limit C 10;
 dump E;
 Output is always empty:
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1551:
---

Status: Open  (was: Patch Available)

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch, PIG_1551.2.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1551:
---

Attachment: PIG_1551.2.patch

Attaching patch that fixes the two errors Richard pointed out.


 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch, PIG_1551.2.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1551:
---

Status: Patch Available  (was: Open)

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch, PIG_1551.2.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901787#action_12901787
 ] 

Dmitriy V. Ryaboy commented on PIG-1205:


Jeff,
Thanks a lot for pitching in with the tests!

I was using 0.20.0 and the old tests passed. I've only tested the binary 
conversion stuff and other new features  on the Twitter machines, and they do 
run a later HBase version -- perhaps the incompatibility is in the filters or 
binary casters code?
Do you know which tests fail with 0.20.0?

I will definitely add a bunch of documentation.

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, 
 PIG_1205_8.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12902008#action_12902008
 ] 

Dmitriy V. Ryaboy commented on PIG-1205:


Ok, let's upgrade to 20.6 then. We could work around by serializing the filters 
ourselves, and applying them to the scan when reading the UDFContext, but seems 
a bit overboard, and folks should be upgrading anyway. 

*Commiters*: this is ready for review.



 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch, 
 PIG_1205_8.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1551:
---

Attachment: PIG_1551.3.patch

Ugh. Thank you for catching that -- fixed, and added a test to make sure it 
stays fixed.

The particular set of methods I needed this for used primitives, so that's what 
I did. It's a bit tricky to add support for Long, Double, etc arrays, as I 
would have to check all combinations of possible method signatures when seeing 
things like (int[], int[], int[]) -- it becomes fairly ugly code.. Do you think 
this is particularly compelling? I can't really think of methods that take 
arrays of Number classes; usually, if you start using Numbers, you are also 
using Collections, not plain arrays.

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch, PIG_1551.2.patch, PIG_1551.3.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1551:
---

  Status: Resolved  (was: Patch Available)
Release Note: 
The idea is simple: frequently, Pig users need to use a simple function that is 
already provided by standard Java libraries, but for which a UDF has not been 
written. Dynamic Invokers allow a Pig programmer to refer to Java functions 
without having to wrap them in custom Pig UDFs, at the cost of doing some Java 
reflection on every function call.

{code}
DEFINE UrlDecode InvokeForString('java.net.URLDecoder.decode', 'String String');
encoded_strings = LOAD 'encoded_strings.txt' as (encoded:chararray);
decoded_strings = FOREACH encoded_strings GENERATE UrlDecode(encoded, 'UTF-8');
{code}

Currently, Dynamic Invokers can be used for any static function that accepts no 
arguments or some combination of Strings, ints, longs, doubles, floats, or 
arrays of same, and returns a String, an int, a long, a double, or a float. 
Primitives only for the numbers, no capital-letter numeric classes as 
arguments. Depending on the return type, a specific kind of Invoker must be 
used: InvokeForString, InvokeForInt, InvokeForLong, InvokeForDouble, or 
InvokeForFloat.

The DEFINE keyword is used to bind a keyword to a Java method, as above. The 
first argument to the InvokeFor* constructor is the full path to the desired 
method. The second argument is a space-delimited ordered list of the classes of 
the method arguments. This can be omitted or an empty string if the method 
takes no arguments. Valid class names are String, Long, Float, Double, and Int. 
Invokers can also work with array arguments, represented in Pig as DataBags of 
single-tuple elements. Simply refer to string[], for example. Class names are 
not case-sensitive.

The ability to use invokers on methods that take array arguments makes methods 
like those in org.apache.commons.math.stat.StatUtils available for processing 
the results of grouping your datasets, for example. This is very nice, but a 
word of caution: the resulting UDF will of course not be optimized for Hadoop, 
and the very significant benefits one gains from implementing the Algebraic and 
Accumulative interfaces are lost here. Be careful with this one.
  Resolution: Fixed

Commited.

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch, PIG_1551.2.patch, PIG_1551.3.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1354) UDFs for dynamic invocation of simple Java methods

2010-08-24 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1354:
---

Release Note: Please see PIG-1551 release notes.

 UDFs for dynamic invocation of simple Java methods
 --

 Key: PIG-1354
 URL: https://issues.apache.org/jira/browse/PIG-1354
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch


 The need to create wrapper UDFs for simple Java functions creates unnecessary 
 work for Pig users, slows down the development process, and produces a lot of 
 trivial classes. We can use Java's reflection to allow invoking a number of 
 methods on the fly, dynamically, by creating a generic UDF to accomplish this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1354) UDFs for dynamic invocation of simple Java methods

2010-08-23 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901584#action_12901584
 ] 

Dmitriy V. Ryaboy commented on PIG-1354:


Olga,
There is a follow-up ticket here: https://issues.apache.org/jira/browse/PIG-1551
If that gets committed, I have a pretty detailed explanation of how to use the 
stuff in 
http://squarecog.wordpress.com/2010/08/20/upcoming-features-in-pig-0-8-dynamic-invokers/
 (happy to put the link in release notes, or just paste the whole post).

 UDFs for dynamic invocation of simple Java methods
 --

 Key: PIG-1354
 URL: https://issues.apache.org/jira/browse/PIG-1354
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1354.patch, PIG-1354.patch, PIG-1354.patch


 The need to create wrapper UDFs for simple Java functions creates unnecessary 
 work for Pig users, slows down the development process, and produces a lot of 
 trivial classes. We can use Java's reflection to allow invoking a number of 
 methods on the fly, dynamically, by creating a generic UDF to accomplish this.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1555) [piggybank] add CSV Loader

2010-08-23 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901697#action_12901697
 ] 

Dmitriy V. Ryaboy commented on PIG-1555:


Alan,
The differences I observe when running on actual csv files are within the 
margin of error -- sometimes CSVLoader comes out on top. Then again I am 
reading actual CSVs with quoted commas, so it's possible that the similarity in 
runtimes is due to the fact that PigStorage sees the commas and allocates extra 
tuple fields.

-D

 [piggybank] add CSV Loader
 --

 Key: PIG-1555
 URL: https://issues.apache.org/jira/browse/PIG-1555
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG_1555.patch


 Users often ask for a CSV loader that can handle quoted commas. Let's get 'er 
 done.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-22 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1205:
---

Attachment: PIG_1205_7.patch

Implemented LoadPushDown (NOTE: this involved a slight backwards-compatible 
refactoring of Utf8StorageConverter).
Refactored the tests a bit.

At this point I think we are good except for further testing and documentation.

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-22 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1205:
---

Status: Patch Available  (was: Open)

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch, PIG_1205_7.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1555) [piggybank] add CSV Loader

2010-08-22 Thread Dmitriy V. Ryaboy (JIRA)
[piggybank] add CSV Loader
--

 Key: PIG-1555
 URL: https://issues.apache.org/jira/browse/PIG-1555
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.8.0


Users often ask for a CSV loader that can handle quoted commas. Let's get 'er 
done.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1555) [piggybank] add CSV Loader

2010-08-22 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1555:
---

Attachment: PIG_1555.patch

This is loosely based on the loader by James Kebinger that he open-sourced at 
http://github.com/jkebinger/pig-user-defined-functions 

I ported to the new API and fixed a few bugs.

Still doesn't support multi-line records, but the basic stuff works, including 
quoting quotes by more quotes, excel-style.

 [piggybank] add CSV Loader
 --

 Key: PIG-1555
 URL: https://issues.apache.org/jira/browse/PIG-1555
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG_1555.patch


 Users often ask for a CSV loader that can handle quoted commas. Let's get 'er 
 done.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1555) [piggybank] add CSV Loader

2010-08-22 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1555:
---

Status: Patch Available  (was: Open)

 [piggybank] add CSV Loader
 --

 Key: PIG-1555
 URL: https://issues.apache.org/jira/browse/PIG-1555
 Project: Pig
  Issue Type: New Feature
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG_1555.patch


 Users often ask for a CSV loader that can handle quoted commas. Let's get 'er 
 done.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1508) Make 'docs' target (forrest) work with Java 1.6

2010-08-22 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901158#action_12901158
 ] 

Dmitriy V. Ryaboy commented on PIG-1508:


In http://comments.gmane.org/gmane.text.xml.forrest.user/4899 a forrest 
committer says This validate sitemap task doesn't really do much anyway.
Its main purpose is to demonstrate the power of using Jing to do xml validation 
during the build phase. There are other better demonstrations of that.

Sounds like this is safe to do. 
+1


 Make 'docs' target (forrest) work with Java 1.6
 ---

 Key: PIG-1508
 URL: https://issues.apache.org/jira/browse/PIG-1508
 Project: Pig
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.7.0
Reporter: Carl Steinbach
Assignee: Carl Steinbach
 Attachments: PIG-1508.patch.txt


 FOR-984 covers the very inconvenient fact that Forrest 0.8 does not work with 
 Java 1.6
 The same ticket also suggests a workaround: disabling sitemap and stylesheet 
 validation
 by setting the forrest.validate.sitemap and forrest.validate.stylesheets 
 properties to false.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1237) Piggybank MutliStorage - specify field to write in output

2010-08-22 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12901161#action_12901161
 ] 

Dmitriy V. Ryaboy commented on PIG-1237:


Gerrit,
Sorry this fell through the cracks! Just noticed this ticket.

The ability to specify just one column seems very limited. Perhaps instead one 
could optionally specify whether to materialize the splitField? I think this 
would accomplish the same thing in a more general manner.

Also perhaps this warrants a second constructor, as introducing new arguments 
to the existing one will break backwards compatibility.

 Piggybank MutliStorage - specify field to write in output
 -

 Key: PIG-1237
 URL: https://issues.apache.org/jira/browse/PIG-1237
 Project: Pig
  Issue Type: Improvement
Reporter: Gerrit Jansen van Vuuren
Assignee: Gerrit Jansen van Vuuren
Priority: Minor
 Attachments: PIG-1237.patch


 I've made a modification to the piggy bank MutliStorage class that allows to 
 optionally specify the index of the field in each tuple to write to output.
 This feature allows to have records with metadata like seqno, time of upload 
 etc, and then to combine files from these records into one but without the 
 metadata.
 e.g. 
 1: date type seq1 data
 2:  date type seq2 data
 then write output grouped by type and ordered by sequence:
 data
 data

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-21 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1205:
---

Status: Open  (was: Patch Available)

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-21 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1205:
---

Status: Patch Available  (was: Open)

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-21 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1205:
---

Attachment: PIG_1205_6.patch

Fixed test (but did not add new tests).
Made default caster configurable by setting pig.hbase.caster property. 
Made rowKey filters (gt, lt, gte, lte) filter out regions when possible. Tested 
manually.

Jeff, to your comments about shifting to cut off regions --  I think it's 
better to have the loader think about region sizes, and let the user only worry 
about key values. If they are intimate enough with their tables to know region 
boundaries, they should know which end of a region is inclusive and which is 
exclusive, and provide the correct filters.

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path, PIG_1205_6.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-20 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy reassigned PIG-1551:
--

Assignee: Dmitriy V. Ryaboy

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-19 Thread Dmitriy V. Ryaboy (JIRA)
Improve dynamic invokers to deal with no-arg methods and array parameters
-

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Reporter: Dmitriy V. Ryaboy


PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
Java methods in a UDF, so that users don't need to create trivial wrappers if 
they are ok sacrificing some speed.

This issue is to extend the set of methods that can be wrapped this way to 
include methods that do not take any arguments, and methods that take arrays of 
{int,long,float,double,string} as arguments. 
Arrays are expected to be represented by bags in Pig. Notably, this allows 
users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-19 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1551:
---

Attachment: PIG-1551.patch

Patch attached.

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1551) Improve dynamic invokers to deal with no-arg methods and array parameters

2010-08-19 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1551:
---

   Status: Patch Available  (was: Open)
Affects Version/s: 0.8.0
Fix Version/s: 0.8.0

 Improve dynamic invokers to deal with no-arg methods and array parameters
 -

 Key: PIG-1551
 URL: https://issues.apache.org/jira/browse/PIG-1551
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1551.patch


 PIG-1354 introduced a set of UDFs that can be used to dynamically wrap simple 
 Java methods in a UDF, so that users don't need to create trivial wrappers if 
 they are ok sacrificing some speed.
 This issue is to extend the set of methods that can be wrapped this way to 
 include methods that do not take any arguments, and methods that take arrays 
 of {int,long,float,double,string} as arguments. 
 Arrays are expected to be represented by bags in Pig. Notably, this allows 
 users to wrap statistical functions in o.a.commons.math.stat.StatUtils . 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple

2010-08-17 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1420:
---

Attachment: PIG-1420.2.patch

This should fix the problem :). LMK if you'd like me to commit this.

 Make CONCAT act on all fields of a tuple, instead of just the first two 
 fields of a tuple
 -

 Key: PIG-1420
 URL: https://issues.apache.org/jira/browse/PIG-1420
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.8.0

 Attachments: addconcat2.patch, PIG-1420.2.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and 
 org.apache.pig.builtin.StringConcat (which acts on Strings internally), both 
 act on the first two fields of a tuple.  This results in ugly nested CONCAT 
 calls like:
 CONCAT(CONCAT(A, ' '), B)
 The more desirable form is:
 CONCAT(A, ' ', B)
 This change will be backwards compatible, provided that no one was relying on 
 the fact that CONCAT ignores fields after the first two in a tuple.  This 
 seems a reasonable assumption to make, or at least a small break in 
 compatibility for a sizable improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple

2010-08-17 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899587#action_12899587
 ] 

Dmitriy V. Ryaboy commented on PIG-1420:


Right.. i forgot people don't call StringConcat directly.
I don't know how one specifies a vararg schema. Hints?

 Make CONCAT act on all fields of a tuple, instead of just the first two 
 fields of a tuple
 -

 Key: PIG-1420
 URL: https://issues.apache.org/jira/browse/PIG-1420
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.8.0

 Attachments: addconcat2.patch, PIG-1420.2.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and 
 org.apache.pig.builtin.StringConcat (which acts on Strings internally), both 
 act on the first two fields of a tuple.  This results in ugly nested CONCAT 
 calls like:
 CONCAT(CONCAT(A, ' '), B)
 The more desirable form is:
 CONCAT(A, ' ', B)
 This change will be backwards compatible, provided that no one was relying on 
 the fact that CONCAT ignores fields after the first two in a tuple.  This 
 seems a reasonable assumption to make, or at least a small break in 
 compatibility for a sizable improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1420) Make CONCAT act on all fields of a tuple, instead of just the first two fields of a tuple

2010-08-17 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899597#action_12899597
 ] 

Dmitriy V. Ryaboy commented on PIG-1420:


Yeah, let's plan to add a way to specify a vararg in the schema in 0.9.

In the meantime, what do we do with concat? Option 1: leave broken (only works 
for 2 arguments). Option 2: take out arg2func mapping, and have people who want 
to concat strings use StringConcat explicitly.

Actually, there is an option 3, which makes more sense than option 2: make 
CONCAT actually do what StringConcat does, and introduce BinConcat (since it 
seems unlikely people are actually concatting bytearrays...).

 Make CONCAT act on all fields of a tuple, instead of just the first two 
 fields of a tuple
 -

 Key: PIG-1420
 URL: https://issues.apache.org/jira/browse/PIG-1420
 Project: Pig
  Issue Type: Improvement
  Components: impl
Affects Versions: 0.8.0
Reporter: Russell Jurney
Assignee: Russell Jurney
 Fix For: 0.8.0

 Attachments: addconcat2.patch, PIG-1420.2.patch

   Original Estimate: 24h
  Remaining Estimate: 24h

 org.apache.pig.builtin.CONCAT (which acts on DataByteArray's internally) and 
 org.apache.pig.builtin.StringConcat (which acts on Strings internally), both 
 act on the first two fields of a tuple.  This results in ugly nested CONCAT 
 calls like:
 CONCAT(CONCAT(A, ' '), B)
 The more desirable form is:
 CONCAT(A, ' ', B)
 This change will be backwards compatible, provided that no one was relying on 
 the fact that CONCAT ignores fields after the first two in a tuple.  This 
 seems a reasonable assumption to make, or at least a small break in 
 compatibility for a sizable improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-17 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12899722#action_12899722
 ] 

Dmitriy V. Ryaboy commented on PIG-1205:


bq. 1. Is it possible to specify min_row_key and max_row_key in parameters

Even better than that -- you can specify lt, lte, gt, and gte. It's true that 
as written splits will be created for the whole table, but the filters will 
cause most of those splits to immediately exit. Not creating the splits is on 
my todo list (I already do this in the elephantbird version for 0.6)

bq. 2. One small suggestion: move line 206 to if block (only one time setting 
is enough)

Good idea.

bq. 3. It's better to add warning log in HBaseBinaryConverter when the bytes is 
cut off for type conversion 

Will do. 

bq. 4. The parameter Per-region limit is a bit confusing for me, I think 
users would like to the set the limit on the whole table not per region. What 
do you think ?

Trouble is, you can't enforce a total limit without post-processing. In 
practice, I use -limit when I am experimenting and want to get just a few rows 
from HBase; if I want a specific number of rows, I use both -limit (to speed up 
the tasks, since the scanners will exit early), and Pig's LIMIT operator (to 
get the exact number of rows I need).



 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-08-16 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1205:
---

Attachment: PIG_1205_5.path

This patch (not really review-ready yet) introduces the Elephant-Bird 
improvements.

You can use -gt, -gte, -lt, -lte flags to filter out row ranges, specify 
caching and per-region row limits, and you can specify the caster to use 
(interpret Strings, as before, or use bytes directly for more eficient storage 
and communication).

The filtering is a bit off because it still spins up all the map tasks, the 
ones whose keys are filtered out just finish extremely fast. 

The progress reporting is a bit jittery, but better than nothing.

TODO: fix up filtering, add projection pushdown, add filter pushdown, and write 
better tests.



 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch, PIG_1205_5.path




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1516) finalize in bag implementations causes pig to run out of memory in reduce

2010-07-26 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892375#action_12892375
 ] 

Dmitriy V. Ryaboy commented on PIG-1516:


Another workaround for the meantime:

One can introduce a SmallBagFactory that inherits from BagFactory and produces 
SmallBags which implement DataBag() without a finalize, and does not implement 
the file spilling behavior.  SmallBagFactory would return SmallBags when 
bagFactory.newDefaultBag() is called. Then, provide the system properties 
pig.data.bag.factory.name and pig.data.bag.factory.jar in pig.properties to 
point to the new classes. 

Naturally, one has to be certain that databags won't need to spill to disk when 
doing this...


Ankur -- so what are you suggesting as a fix that avoids finalize? 

 finalize in bag implementations causes pig to run out of memory in reduce 
 --

 Key: PIG-1516
 URL: https://issues.apache.org/jira/browse/PIG-1516
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


 *Problem:*
 pig bag implementations that are subclasses of DefaultAbstractBag, have 
 finalize methods implemented. As a result, the garbage collector moves them 
 to a finalization queue, and the memory used is freed only after the 
 finalization happens on it.
 If the bags are not finalized fast enough, a lot of memory is consumed by the 
 finalization queue, and pig runs out of memory. This can happen if large 
 number of small bags are being created.
 *Solution:*
 The finalize function exists for the purpose of deleting the spill files that 
 are created when the bag is too large. But if the bags are small enough, no 
 spill files are created, and there is no use of the finalize function.
  A new class that holds a list of files will be introduced (FileList). This 
 class will have a finalize method that deletes the files. The bags will no 
 longer have finalize methods, and the bags will use FileList instead of 
 ArrayListFile.
 *Possible workaround for earlier releases:*
 Since the fix is going into 0.8, here is a workaround -
 Disabling the combiner will reduce the number of bags getting created, as 
 there will not be the stage of combining intermediate merge results. But I 
 would recommend disabling it only if you have this problem as it is likely to 
 slow down the query .
 To disable combiner, set the property: -Dpig.exec.nocombiner=true

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-07-26 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12892376#action_12892376
 ] 

Dmitriy V. Ryaboy commented on PIG-1205:


I can integrate my changes by then. 

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1150) VAR() Variance UDF

2010-07-23 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891863#action_12891863
 ] 

Dmitriy V. Ryaboy commented on PIG-1150:


Meh. Go ahead and commit. Don't put it into builtin, since it has math problems 
at scale. Ok for piggybank.

 VAR() Variance UDF
 --

 Key: PIG-1150
 URL: https://issues.apache.org/jira/browse/PIG-1150
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.5.0
 Environment: UDF, written in Pig 0.5 contrib/
Reporter: Russell Jurney
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: var.patch


 I've implemented a UDF in Pig 0.5 that implements Algebraic and calculates 
 variance in a distributed manner, based on the AVG() builtin.  It works by 
 calculating the count, sum and sum of squares, as described here: 
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
 Is this a worthwhile contribution?  Taking the square root of this value 
 using the contrib SQRT() function gives Standard Deviation, which is missing 
 from Pig.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1205) Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc

2010-07-23 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12891864#action_12891864
 ] 

Dmitriy V. Ryaboy commented on PIG-1205:


When is the cut-off date for that?

 Enhance HBaseStorage-- Make it support loading row key and implement StoreFunc
 --

 Key: PIG-1205
 URL: https://issues.apache.org/jira/browse/PIG-1205
 Project: Pig
  Issue Type: Sub-task
Affects Versions: 0.7.0
Reporter: Jeff Zhang
Assignee: Jeff Zhang
 Fix For: 0.8.0

 Attachments: PIG_1205.patch, PIG_1205_2.patch, PIG_1205_3.patch, 
 PIG_1205_4.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1500) guava.jar should be removed from the lib folder

2010-07-21 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890603#action_12890603
 ] 

Dmitriy V. Ryaboy commented on PIG-1500:


Have you tried actually building with this? The reason I put guava r3 into lib 
was that the public maven deploy for it is broken.

Here's what happens when I apply this patch and try to build:

{code}
[ivy:resolve]  WARNINGS
[ivy:resolve]   problem while downloading module descriptor: 
http://repo1.maven.org/maven2/com/google/guava/guava/r03/guava-r03.pom: invalid 
sha1: expected=1cbd6fab2460050ff7147b6d8536f39c8f535067 
computed=7a37041386ee39a1fbb3efd3c4c6932809cb5887 (1304ms)
{code}

Now, we can probably still get away with removing guava from lib/ -- they just 
release guava-r6, which should be compatible with the guava-dependent code in 
Pig, and is supposed to have a proper maven deploy. But the patch as is should 
not be applied.

 guava.jar should be removed from the lib folder
 ---

 Key: PIG-1500
 URL: https://issues.apache.org/jira/browse/PIG-1500
 Project: Pig
  Issue Type: Bug
  Components: build
Reporter: Giridharan Kesavan
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: removeGuavaJar.patch


 guava jar is available in the maven repository but still its is checked into 
 the pig trunk's lib folder.
 I ve checked the availability of guava jar in the maven repository.
 http://mvnrepository.com/artifact/com.google.guava/guava

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1478) Add progress notification listener to PigRunner API

2010-07-19 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12890090#action_12890090
 ] 

Dmitriy V. Ryaboy commented on PIG-1478:


This seems to fit the bill.

 Add progress notification listener to PigRunner API
 ---

 Key: PIG-1478
 URL: https://issues.apache.org/jira/browse/PIG-1478
 Project: Pig
  Issue Type: Improvement
Reporter: Richard Ding
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1478.patch


 PIG-1333 added PigRunner API to allow Pig users and tools to get a 
 status/stats object back after executing a Pig script. The new API, however, 
 is synchronous (blocking). It's known that a Pig script can spawn tens (even 
 hundreds) MR jobs and take hours to complete. Therefore it'll be nice to give 
 progress feedback to the callers during the execution.
 The proposal is to add an optional parameter to the API:
 {code}
 public abstract class PigRunner {
 public static PigStats run(String[] args, PigProgressNotificationListener 
 listener) {...}
 }
 {code} 
 The new listener is defined as following:
 {code}
 package org.apache.pig.tools.pigstats;
 public interface PigProgressNotificationListener extends 
 java.util.EventListener {
 // just before the launch of MR jobs for the script
 public void LaunchStartedNotification(int numJobsToLaunch);
 // number of jobs submitted in a batch
 public void jobsSubmittedNotification(int numJobsSubmitted);
 // a job is started
 public void jobStartedNotification(String assignedJobId);
 // a job is completed successfully
 public void jobFinishedNotification(JobStats jobStats);
 // a job is failed
 public void jobFailedNotification(JobStats jobStats);
 // a user output is completed successfully
 public void outputCompletedNotification(OutputStats outputStats);
 // updates the progress as percentage
 public void progressUpdatedNotification(int progress);
 // the script execution is done
 public void launchCompletedNotification(int numJobsSucceeded);
 }
 {code}
 Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1473) Avoid serialization/deserialization costs for PigStorage data - Use custom Map and Bag implementation

2010-07-15 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888942#action_12888942
 ] 

Dmitriy V. Ryaboy commented on PIG-1473:


Thejas, do you think there could be any performance gains if we could delay 
deserialization of the top-level fields in the tuple, but deserialize whole 
maps or databags if they are touched?

 Avoid serialization/deserialization costs for PigStorage data - Use custom 
 Map and Bag implementation
 -

 Key: PIG-1473
 URL: https://issues.apache.org/jira/browse/PIG-1473
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


 Cost of serialization/deserialization (sedes) can be very high and avoiding 
 it will improve performance.
 Avoid sedes when possible by implementing approach #3 proposed in 
 http://wiki.apache.org/pig/AvoidingSedes .
 The load function uses subclass of Map and DataBag which holds the serialized 
 copy.  LoadFunction delays deserialization of map and bag types until a 
 member function of java.util.Map or DataBag is called. 
 Example of query where this will help -
 {CODE}
 l = LOAD 'file1' AS (a : int, b : map [ ]);
 f = FOREACH l GENERATE udf1(a), b;  
 fil = FILTER f BY $0  5;
 dump fil; -- Serialization of column b can be delayed until here using this 
 approach .
 {CODE}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (PIG-1428) Make a StatusReporter singleton available for incrementing counters

2010-07-14 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy closed PIG-1428.
--


 Make a StatusReporter singleton available for incrementing counters
 ---

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch, PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1428) Make a StatusReporter singleton available for incrementing counters

2010-07-14 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888545#action_12888545
 ] 

Dmitriy V. Ryaboy commented on PIG-1428:


I ran the new and changed tests manually before committing, but not the whole 
set (didn't have 12 hours to spare). Which tests are failing for you?

 Make a StatusReporter singleton available for incrementing counters
 ---

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch, PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1434) Allow casting relations to scalars

2010-07-14 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888546#action_12888546
 ] 

Dmitriy V. Ryaboy commented on PIG-1434:


+1 for casting as tuple.  Though it may have to look like

{code}
Y = foreach Z generate X::$1/(long) ((tuple)C).count, X::$2 - (long) 
((tuple)C).max;
{code}

Definitely -1 on the bracket syntax.. it seems very non-intuitive. 

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1428) Make a StatusReporter singleton available for incrementing counters

2010-07-14 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888588#action_12888588
 ] 

Dmitriy V. Ryaboy commented on PIG-1428:


Found the culprit, will commit fix within ~ 20 mins assuming tests pass.


 Make a StatusReporter singleton available for incrementing counters
 ---

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch, PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1428) Make a StatusReporter singleton available for incrementing counters

2010-07-14 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12888599#action_12888599
 ] 

Dmitriy V. Ryaboy commented on PIG-1428:


yeah that's the patch I have, verbatim. Sorry about breaking the build again. 

 Make a StatusReporter singleton available for incrementing counters
 ---

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: npe.patch, PIG-1428.patch, PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Closed: (PIG-1449) RegExLoader hangs on lines that don't match the regular expression

2010-07-02 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy closed PIG-1449.
--


 RegExLoader hangs on lines that don't match the regular expression
 --

 Key: PIG-1449
 URL: https://issues.apache.org/jira/browse/PIG-1449
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Justin Sanders
Priority: Minor
 Fix For: 0.8.0

 Attachments: PIG-1449-RegExLoaderInfiniteLoopFix.patch, 
 RegExLoader.patch


 In the 0.7.0 changes to RegExLoader there was a bug introduced where the code 
 will stay in the while loop if the line isn't matched.  Before 0.7.0 these 
 lines would be skipped if they didn't match the regular expression.  The 
 result is the mapper will not respond and will time out with Task attempt_X 
 failed to report status for 600 seconds. Killing!.
 Here are the steps to recreate the bug:
 Create a text file in HDFS with the following lines:
 test1
 testA
 test2
 Run the following pig script:
 REGISTER /usr/local/pig/contrib/piggybank/java/piggybank.jar;
 test = LOAD '/path/to/test.txt' using 
 org.apache.pig.piggybank.storage.MyRegExLoader('(test\\d)') AS (line);
 dump test;
 Expected result:
 (test1)
 (test3)
 Actual result:
 Job fails to complete after 600 second timeout waiting on the mapper to 
 complete.  The mapper hangs at 33% since it can process the first line but 
 gets stuck into the while loop on the second line.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1469) DefaultDataBag assumes ArrayList as default List type

2010-07-02 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1469:
---

Status: Resolved  (was: Patch Available)
Resolution: Fixed

I committed this.

 DefaultDataBag assumes ArrayList as default List type
 -

 Key: PIG-1469
 URL: https://issues.apache.org/jira/browse/PIG-1469
 Project: Pig
  Issue Type: Bug
  Components: data
Affects Versions: 0.8.0
Reporter: Gianmarco De Francisci Morales
Assignee: Gianmarco De Francisci Morales
 Fix For: 0.8.0

 Attachments: PIG-1469.patch


 In org.apache.pig.data.DefaultDataBag, the field mContents is assumed to be 
 of type ArrayList but the user can actually pass a different List to the 
 constructor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-07-02 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884763#action_12884763
 ] 

Dmitriy V. Ryaboy commented on PIG-928:
---

Aniket, the patch does not apply cleanly to trunk, can you rebase it? 

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: calltrace.png, package.zip, pig-greek.tgz, 
 pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, 
 RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, 
 RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-928) UDFs in scripting languages

2010-07-02 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-928:
--

Attachment: PIG-928.patch

I rebased the patch and made it pull jython down via maven. 2.5.1 doesn't 
appear to be available right now, so this pulls down 2.5.0. Hope that's ok.

Looks like the tabulation is wrong in most of this patch.. someone please hit 
ctrl-a, ctrl-i next time :).

Needless to say, this thing needs tests, desperately.

Also imho in order for it to make it into trunk, it should be a compile-time 
option to support (and pull down) jython or jruby or whatnot, not a default 
option. Otherwise we are well on our way to making people pull down the 
internet in order to compile pig.

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: calltrace.png, package.zip, PIG-928.patch, 
 pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, 
 RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, 
 RegisterScriptUDFDefineParse.patch, scripting.tgz, scripting.tgz, test.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-928) UDFs in scripting languages

2010-07-02 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884845#action_12884845
 ] 

Dmitriy V. Ryaboy commented on PIG-928:
---

Aniket, I already made the changes you need to pull down jython -- take a look 
at the patch I attached.

One more general note -- let's say jython instead of python (in the grammar, 
the keywords, everywhere), as there may be slight incompatibilities between the 
two and we want to be clear on what we are using.

 UDFs in scripting languages
 ---

 Key: PIG-928
 URL: https://issues.apache.org/jira/browse/PIG-928
 Project: Pig
  Issue Type: New Feature
Reporter: Alan Gates
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: calltrace.png, package.zip, PIG-928.patch, 
 pig-greek.tgz, pig.scripting.patch.arnab, pyg.tgz, RegisterPythonUDF2.patch, 
 RegisterPythonUDF3.patch, RegisterPythonUDF4.patch, 
 RegisterPythonUDFFinale.patch, RegisterScriptUDFDefineParse.patch, 
 scripting.tgz, scripting.tgz, test.zip


 It should be possible to write UDFs in scripting languages such as python, 
 ruby, etc.  This frees users from needing to compile Java, generate a jar, 
 etc.  It also opens Pig to programmers who prefer scripting languages over 
 Java.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1434) Allow casting relations to scalars

2010-07-01 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884369#action_12884369
 ] 

Dmitriy V. Ryaboy commented on PIG-1434:


A couple of thoughts that came out of the Pig conributor meeting:

1) rather than scalar, we should make this work for single-tuple relations. 
That way a user can do something like this: 

{code}
A = load 'data' as (x, y, z);
B = group A all;
C = foreach B generate COUNT(A) as count, MAX(A.y) as max;
.
X = 
Y = foreach X generate $1/(long) C.count, $2-(long) C.max;
{code}

2) Writing the intermediate relation to a file can cause hotspots. We should 
push this into the distributed cache. In cases when the dist. cache is turned 
off, we can at least increase the replication factor to some large-ish number 
(10, maybe, like the jobs?)

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1434) Allow casting relations to scalars

2010-07-01 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12884502#action_12884502
 ] 

Dmitriy V. Ryaboy commented on PIG-1434:


SQL fails at runtime when executing queries that require a single row to be 
returned. So, oracle won't complain if you do this, for example:

{code}

SELECT foo.a, (SELECT c 
   FROM bar 
   WHERE foo.a = bar.a) 
from foo

{code}

unless the inner select produces more than one row. I think we should adopt the 
same approach -- assume the query is innocent until proven guilty.

-D

 Allow casting relations to scalars
 --

 Key: PIG-1434
 URL: https://issues.apache.org/jira/browse/PIG-1434
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Aniket Mokashi
 Fix For: 0.8.0

 Attachments: scalarImpl.patch


 This jira is to implement a simplified version of the functionality described 
 in https://issues.apache.org/jira/browse/PIG-801.
 The proposal is to allow casting relations to scalar types in foreach.
 Example:
 A = load 'data' as (x, y, z);
 B = group A all;
 C = foreach B generate COUNT(A);
 .
 X = 
 Y = foreach X generate $1/(long) C;
 Couple of additional comments:
 (1) You can only cast relations including a single value or an error will be 
 reported
 (2) Name resolution is needed since relation X might have field named C in 
 which case that field takes precedence.
 (3) Y will look for C closest to it.
 Implementation thoughts:
 The idea is to store C into a file and then convert it into scalar via a UDF. 
 I believe we already have a UDF that Ben Reed contributed for this purpose. 
 Most of the work would be to update the logical plan to
 (1) Store C
 (2) convert the cast to the UDF

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1427) Monitor and kill runaway UDFs

2010-06-22 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1427:
---

Status: Patch Available  (was: Open)

 Monitor and kill runaway UDFs
 -

 Key: PIG-1427
 URL: https://issues.apache.org/jira/browse/PIG-1427
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, 
 PIG-1427.diff, PIG-1427.diff, PIG-1427.diff


 As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
 It is often preferable to return null or some other default value instead of 
 timing out a runaway evaluation and killing a job. We have in the past seen 
 complex regular expressions lead to job failures due to just half a dozen 
 (out of millions) particularly obnoxious strings.
 It would be great to give Pig users a lightweight way of enabling UDF 
 monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1427) Monitor and kill runaway UDFs

2010-06-22 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1427:
---

Attachment: PIG-1427.diff

Final version of the patch.

 Monitor and kill runaway UDFs
 -

 Key: PIG-1427
 URL: https://issues.apache.org/jira/browse/PIG-1427
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, 
 PIG-1427.diff, PIG-1427.diff, PIG-1427.diff


 As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
 It is often preferable to return null or some other default value instead of 
 timing out a runaway evaluation and killing a job. We have in the past seen 
 complex regular expressions lead to job failures due to just half a dozen 
 (out of millions) particularly obnoxious strings.
 It would be great to give Pig users a lightweight way of enabling UDF 
 monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1427) Monitor and kill runaway UDFs

2010-06-22 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1427:
---

   Status: Resolved  (was: Patch Available)
Fix Version/s: 0.8.0
   Resolution: Fixed

Committed. 

 Monitor and kill runaway UDFs
 -

 Key: PIG-1427
 URL: https://issues.apache.org/jira/browse/PIG-1427
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, 
 PIG-1427.diff, PIG-1427.diff, PIG-1427.diff


 As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
 It is often preferable to return null or some other default value instead of 
 timing out a runaway evaluation and killing a job. We have in the past seen 
 complex regular expressions lead to job failures due to just half a dozen 
 (out of millions) particularly obnoxious strings.
 It would be great to give Pig users a lightweight way of enabling UDF 
 monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1333) API interface to Pig

2010-06-22 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12881513#action_12881513
 ] 

Dmitriy V. Ryaboy commented on PIG-1333:


+1

 API interface to Pig
 

 Key: PIG-1333
 URL: https://issues.apache.org/jira/browse/PIG-1333
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1333.patch, PIG-1333_1.patch, PIG-1333_2.patch, 
 PIG-1333_3.patch


 It would be nice to make Pig more friendly for applications like workflow 
 that would be executing pig scripts on user behalf.
 Currently, they would have to use pig command line to execute the code; 
 however, this has limitation on the kind of output that would be delivered. 
 For instance, it is hard to produce error information that is easy to use 
 programatically or collect statistics.
 The proposal is to create a class that mimics the behavior of the Main but 
 gives users a status object back. The the main code of pig would look 
 somethig like:
 public static void main(String args[])
 {
 PigStatus ps = PigMain.exec(args);
 exit (PigStatus.rc);
 }
 We need to define the following:
 - Content of PigStatus. It should at least include
* return code
* error string
* exception 
* statistics
 - A way to propagate the status class through pig code

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1428) Make a StatusReporter singleton available for incrementing counters

2010-06-17 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1428:
---

Status: Resolved  (was: Patch Available)
Resolution: Fixed
  Tags: pig-0.7.1

Committed to trunk.
We may want to consider this for a 0.7.1, if such a thing comes about, as in a 
sense it's addressing a regression.

I tagged this issue with pig-0.7.1 so we can find it later if we decide a 
dot-release is warranted.

 Make a StatusReporter singleton available for incrementing counters
 ---

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch, PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1428) Make a StatusReporter singleton available for incrementing counters

2010-06-14 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1428:
---

   Summary: Make a StatusReporter singleton available for incrementing 
counters  (was: Add getPigStatusReporter() to PigHadoopLogger)
Patch Info: [Patch Available]

 Make a StatusReporter singleton available for incrementing counters
 ---

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch, PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1333) API interface to Pig

2010-06-14 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878732#action_12878732
 ] 

Dmitriy V. Ryaboy commented on PIG-1333:


bq. I'm not sure we should make all Hadoop counters available through the new 
API. How useful will it be to the users? I'm open to suggestions. 

Can't speak for other users, but we use counters quite a bit with Elephant Bird 
and some internal code for keeping track of timed out service requests, 
unparsable records, and more. The @MonitoredUDF annotation I proposed in 
PIG-1427 uses counters to report on runaway udfs that get killed.

I think the question isn't so much why would you expose them, as why wouldn't 
you expose them...

 API interface to Pig
 

 Key: PIG-1333
 URL: https://issues.apache.org/jira/browse/PIG-1333
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1333.patch, PIG-1333_1.patch


 It would be nice to make Pig more friendly for applications like workflow 
 that would be executing pig scripts on user behalf.
 Currently, they would have to use pig command line to execute the code; 
 however, this has limitation on the kind of output that would be delivered. 
 For instance, it is hard to produce error information that is easy to use 
 programatically or collect statistics.
 The proposal is to create a class that mimics the behavior of the Main but 
 gives users a status object back. The the main code of pig would look 
 somethig like:
 public static void main(String args[])
 {
 PigStatus ps = PigMain.exec(args);
 exit (PigStatus.rc);
 }
 We need to define the following:
 - Content of PigStatus. It should at least include
* return code
* error string
* exception 
* statistics
 - A way to propagate the status class through pig code

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1333) API interface to Pig

2010-06-14 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878826#action_12878826
 ] 

Dmitriy V. Ryaboy commented on PIG-1333:


Yup.

 API interface to Pig
 

 Key: PIG-1333
 URL: https://issues.apache.org/jira/browse/PIG-1333
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1333.patch, PIG-1333_1.patch


 It would be nice to make Pig more friendly for applications like workflow 
 that would be executing pig scripts on user behalf.
 Currently, they would have to use pig command line to execute the code; 
 however, this has limitation on the kind of output that would be delivered. 
 For instance, it is hard to produce error information that is easy to use 
 programatically or collect statistics.
 The proposal is to create a class that mimics the behavior of the Main but 
 gives users a status object back. The the main code of pig would look 
 somethig like:
 public static void main(String args[])
 {
 PigStatus ps = PigMain.exec(args);
 exit (PigStatus.rc);
 }
 We need to define the following:
 - Content of PigStatus. It should at least include
* return code
* error string
* exception 
* statistics
 - A way to propagate the status class through pig code

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-06-12 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1428:
---

Attachment: PIG-1428.patch

Once more, with feeling.
This implements Ashutosh's suggestion of making PigStatusReporter maintain a 
singleton and expose a public getInstance() method.

 Add getPigStatusReporter() to PigHadoopLogger
 -

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch, PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1427) Monitor and kill runaway UDFs

2010-06-12 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1427:
---

Attachment: PIG-1427.diff

Slightly modified to match the patch in PIG-1428

 Monitor and kill runaway UDFs
 -

 Key: PIG-1427
 URL: https://issues.apache.org/jira/browse/PIG-1427
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, 
 PIG-1427.diff, PIG-1427.diff


 As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
 It is often preferable to return null or some other default value instead of 
 timing out a runaway evaluation and killing a job. We have in the past seen 
 complex regular expressions lead to job failures due to just half a dozen 
 (out of millions) particularly obnoxious strings.
 It would be great to give Pig users a lightweight way of enabling UDF 
 monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1440) Refactor org.apache.pig.data.DataType to use Enums instead of integer constants

2010-06-06 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876011#action_12876011
 ] 

Dmitriy V. Ryaboy commented on PIG-1440:


I think Enums are great for this, and have wished many a time that the types 
were Enums while working with Pig.

I do want to point out, though, that this will affect a lot of user code -- any 
EvalFunc that specifies a schema, any loadfunc that implements the metadata 
options, etc. Are we willing to break things for our users so soon after 0.7?



 Refactor org.apache.pig.data.DataType to use Enums instead of integer 
 constants
 ---

 Key: PIG-1440
 URL: https://issues.apache.org/jira/browse/PIG-1440
 Project: Pig
  Issue Type: Improvement
Reporter: Gianmarco De Francisci Morales
Priority: Minor

 Refactoring DataType to use Enums instead of integer constants would provide 
 many benefits, including:
 * Cleaner code
 * Easier to iterate over Enums
 * Easier to add new Enums without braking backwards compatibility
 * Can use EnumMaps for easily link values to Enums
 * Better support for translation from Enums to Strings and viceversa
 Int (or byte in Pig's case) Enum pattern has several drawbacks as summarized 
 here http://java.sun.com/j2se/1.5.0/docs/guide/language/enums.html
 Drawbacks:
 We have to explicitly convert Enum values to bytes when serializing. This can 
 be done in DataReaderWriter.
 Possibly higher overhead than simply using bytes.
 Refactoring might be difficult.
 Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-06-06 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1428:
---

Status: Open  (was: Patch Available)

trying to tickle hudson

 Add getPigStatusReporter() to PigHadoopLogger
 -

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-06-06 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1428:
---

Status: Patch Available  (was: Open)

 Add getPigStatusReporter() to PigHadoopLogger
 -

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1427) Monitor and kill runaway UDFs

2010-06-06 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876093#action_12876093
 ] 

Dmitriy V. Ryaboy commented on PIG-1427:


Ashutosh, Alan, et al: review please.

 Monitor and kill runaway UDFs
 -

 Key: PIG-1427
 URL: https://issues.apache.org/jira/browse/PIG-1427
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch, 
 PIG-1427.diff


 As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
 It is often preferable to return null or some other default value instead of 
 timing out a runaway evaluation and killing a job. We have in the past seen 
 complex regular expressions lead to job failures due to just half a dozen 
 (out of millions) particularly obnoxious strings.
 It would be great to give Pig users a lightweight way of enabling UDF 
 monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1333) API interface to Pig

2010-06-06 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12876111#action_12876111
 ] 

Dmitriy V. Ryaboy commented on PIG-1333:


That's a heck of a patch. I am really looking forward to having this available.

Not sure you need the map in the PIG_FEATURE enum.  You can get an enum by 
offset using PIG_FEATURE.values(), so no need for the constructor; you can get 
a string representation using pigFeature.name() or pigFeature.toString(), so no 
need for getString(); and you can get the ordinal using pigFeature.ordinal(). 
Granted, the ordinals are 0-based, but you can just throw in an dummy value for 
the 0th spot to preserve all the offsets as they are.

I see that you explicitly pull out the known and enumerated Pig counters. Any 
reason not to make all other job counters available as well via the same 
interface?



 API interface to Pig
 

 Key: PIG-1333
 URL: https://issues.apache.org/jira/browse/PIG-1333
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1333.patch


 It would be nice to make Pig more friendly for applications like workflow 
 that would be executing pig scripts on user behalf.
 Currently, they would have to use pig command line to execute the code; 
 however, this has limitation on the kind of output that would be delivered. 
 For instance, it is hard to produce error information that is easy to use 
 programatically or collect statistics.
 The proposal is to create a class that mimics the behavior of the Main but 
 gives users a status object back. The the main code of pig would look 
 somethig like:
 public static void main(String args[])
 {
 PigStatus ps = PigMain.exec(args);
 exit (PigStatus.rc);
 }
 We need to define the following:
 - Content of PigStatus. It should at least include
* return code
* error string
* exception 
* statistics
 - A way to propagate the status class through pig code

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-06-02 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12874865#action_12874865
 ] 

Dmitriy V. Ryaboy commented on PIG-1428:


I notice that the issue has been discussed before in PIG-889, and Santosh 
argued (convincingly) that adding this method to PigLogger might not make 
sense. Santosh, would you like to suggest a different place to put this 
functionality? I am not married to using this method, it's just the path of 
least resistance.

 Add getPigStatusReporter() to PigHadoopLogger
 -

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-06-01 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12873908#action_12873908
 ] 

Dmitriy V. Ryaboy commented on PIG-1428:


Findbugs is quite right to call me out on the synchronization thing. I am not 
sure why the setter needs to by synchronized; I am even less sure the getter 
should be.  Seems like this would add one more lock every time we want to 
increment a counter or write a log line, which is unfortunate (I assume those 
objects handle their own concurrency issues). Can Richard or Pradeep comment on 
that?

 Add getPigStatusReporter() to PigHadoopLogger
 -

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-06-01 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1428:
---

Status: Open  (was: Patch Available)

 Add getPigStatusReporter() to PigHadoopLogger
 -

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-06-01 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1428:
---

Status: Patch Available  (was: Open)

 Add getPigStatusReporter() to PigHadoopLogger
 -

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-06-01 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1428:
---

Attachment: PIG-1428.patch

removed the synchronized keyword

 Add getPigStatusReporter() to PigHadoopLogger
 -

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch, PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-05-31 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy reassigned PIG-1428:
--

Assignee: Dmitriy V. Ryaboy

 Add getPigStatusReporter() to PigHadoopLogger
 -

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
Assignee: Dmitriy V. Ryaboy
 Fix For: 0.8.0

 Attachments: PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-05-31 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1428:
---

Attachment: PIG-1428.patch

No tests, as this is trivial.

 Add getPigStatusReporter() to PigHadoopLogger
 -

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0

 Attachments: PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1428) Add getPigStatusReporter() to PigHadoopLogger

2010-05-31 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1428:
---

Status: Patch Available  (was: Open)

please review if this gets no -1s other than lack of tests.

 Add getPigStatusReporter() to PigHadoopLogger
 -

 Key: PIG-1428
 URL: https://issues.apache.org/jira/browse/PIG-1428
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.7.0
Reporter: Ashutosh Chauhan
 Fix For: 0.8.0

 Attachments: PIG-1428.patch


 Without this getter method, its not possible to get counters, report progress 
 etc. from UDFs. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1427) Monitor and kill runaway UDFs

2010-05-31 Thread Dmitriy V. Ryaboy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy V. Ryaboy updated PIG-1427:
---

Attachment: guava-r03.jar

Attaching the guava jar that needs to be placed in lib/ in order to test this.
It is theoretically available via maven, but at the moment the deploy to maven 
is misconfigured and unfetchable (see above reference).

The guava library is licensed under Apache 2.0: 
http://code.google.com/p/guava-libraries/



 Monitor and kill runaway UDFs
 -

 Key: PIG-1427
 URL: https://issues.apache.org/jira/browse/PIG-1427
 Project: Pig
  Issue Type: New Feature
Affects Versions: 0.8.0
Reporter: Dmitriy V. Ryaboy
Assignee: Dmitriy V. Ryaboy
 Attachments: guava-r03.jar, monitoredUdf.patch, monitoredUdf.patch


 As a safety measure, it is sometimes useful to monitor UDFs as they execute. 
 It is often preferable to return null or some other default value instead of 
 timing out a runaway evaluation and killing a job. We have in the past seen 
 complex regular expressions lead to job failures due to just half a dozen 
 (out of millions) particularly obnoxious strings.
 It would be great to give Pig users a lightweight way of enabling UDF 
 monitoring.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   3   4   >