date:20090501

[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

2009-05-01 Thread Alan Gates (JIRA)

[
https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705037#action_12705037
]

Alan Gates commented on PIG-795:

Eric,

Thanks for the patch. I agree this is a feature that people will find useful.
I have a few questions and comments:

1) Is 1% the minimum sample size people will want to work with? Given that
data in the grid can be on the order of terabytes, I can see people wanting a
0.1% sample, or even 0.01% sample. Maybe that's too hard to specify nicely in
the syntax, or maybe people will be happy with 1% minimum. I'm not sure, but
it's worth thinking about.

2) Sample and limit aren't really related, so implementing this in limit seems
artificial. Could it instead be implemented as a filter with a random
function? So the grammar production would look like:

X = SAMPLE Y a% = X = FILTER Y BY a RANDOM();

with RANDOM being a function you added to return a random number.

The advantage of this is we would hope in the future to push filter operators
down into the load functions themselves. intelligent load functions could then
take this filter and not even deserialize a record until it decided whether it
was going to be kept or not.

3) The patch should include unit tests.

Command that selects a random sample of the rows, similar to LIMIT
--

Key: PIG-795
URL: https://issues.apache.org/jira/browse/PIG-795
Project: Pig
Issue Type: New Feature
Components: impl
Reporter: Eric Gaudet
Priority: Trivial
Attachments: sample2.diff

When working with very large data sets (imagine that!), running a pig script
can take time. It may be useful to run on a small subset of the data in some
situations (eg: debugging / testing, or to get fast results even if less
accurate.)
The command LIMIT N selects the first N rows of the data, but these are not
necessarily randomzed. A command SAMPLE X would retain the row only with
the probability x%.
Note: it is possible to implement this feature with FILTER BY and an UDF, but
so is LIMIT, and limit is built-in.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-796) support conversion from numeric types to chararray

2009-05-01 Thread Olga Natkovich (JIRA)

support  conversion from numeric types to chararray
---

 Key: PIG-796
 URL: https://issues.apache.org/jira/browse/PIG-796
 Project: Pig
  Issue Type: Improvement
Affects Versions: 0.2.0
Reporter: Olga Natkovich




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

2009-05-01 Thread Eric Gaudet (JIRA)

[
https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705072#action_12705072
]

Eric Gaudet commented on PIG-795:
-

Thanks for your feedback. (BTW, should these issues be discussed in a different
place?)

Here's my comments:

1) I agree that the 1% minimum looks arbitrary and annoying, but I decided to
keep it like this for several reasons. Most importantly, I didn't want to
disturb the syntax of LIMIT, which expects an integer. Secondly, 1% is a
reasonable minimum if you want a statistically significant result. And finally,
you can work around the limitation by adding a 2nd level of sample (or more): b
= SAMPLE a 1; c = SAMPLE b 1; gives you 0.01%.

Now that I think about it, it's easy to change the syntax and use a float for
SAMPLE. The value would be a probability between 0.0 and 1.0. It's cleaner this
way, and I will send a new patch for that.

2) I implemented it in limit because they are both specialized filters in a
way, with a similar syntax. This way the code changes are very small.

It already exists as a filter without any coding needed:

b = FILTER a BY org.apache.pig.piggybank.evaluation.math.RANDOM()0.01;

The syntax not very user friendly, though.

3) Will add unit tests in the new patch with floats.

I will produce a new patch with the float syntax and unit tests in the next few
days, unless you tell me you prefer FILTER BY.

Command that selects a random sample of the rows, similar to LIMIT
--

Key: PIG-795
URL: https://issues.apache.org/jira/browse/PIG-795
Project: Pig
Issue Type: New Feature
Components: impl
Reporter: Eric Gaudet
Priority: Trivial
Attachments: sample2.diff

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (PIG-797) Limit with ORDER BY producing wrong results

2009-05-01 Thread Olga Natkovich (JIRA)

Limit with ORDER BY producing wrong results
---

 Key: PIG-797
 URL: https://issues.apache.org/jira/browse/PIG-797
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Olga Natkovich


Query:

A = load 'studenttab10k' as (name, age, gpa);
B = group A by name;
C = foreach B generate group, SUM(A.gpa) as rev;
D = order C by rev;
E = limit D 10;
dump E;

Output:

(alice king,31.7)
(alice laertes,26.453)
(alice thompson,25.867)
(alice van buren,23.59)
(bob allen,19.902)
(bob ichabod,29.0)
(bob king,28.454)
(bob miller,10.28)
(bob underhill,28.137)
(bob van buren,25.992)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

2009-05-01 Thread Olga Natkovich (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705104#action_12705104
 ] 

Olga Natkovich commented on PIG-795:


Can we implement SAMPLE the same way we implement JOIN - as a macro.

This way we will achieve the readability without much code changes. 

 Command that selects a random sample of the rows, similar to LIMIT
 --

 Key: PIG-795
 URL: https://issues.apache.org/jira/browse/PIG-795
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Eric Gaudet
Priority: Trivial
 Attachments: sample2.diff


 When working with very large data sets (imagine that!), running a pig script 
 can take time. It may be useful to run on a small subset of the data in some 
 situations (eg: debugging / testing, or to get fast results even if less 
 accurate.) 
 The command LIMIT N selects the first N rows of the data, but these are not 
 necessarily randomzed. A command SAMPLE X would retain the row only with 
 the probability x%.
 Note: it is possible to implement this feature with FILTER BY and an UDF, but 
 so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-626) Statistics (records read by each mapper and reducer)

2009-05-01 Thread Alan Gates (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705152#action_12705152
 ] 

Alan Gates commented on PIG-626:


Shubham,

I apologize for being so slow to get to this.  I see several issues in the 
patch.

1) The launcher code has changed significantly.  I tried to figure out how to 
paste your code in, but I was afraid I was doing it wrong so I backed off.  
You'll need to integrate your patch with the changes.

2) All error messages now require an error code (see 
http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification).

3) In a couple places when the code can't find the statistics information it 
says:

{code}
 log.info(Jobs not found in the JobClient. Please try using a different Hadoop 
mode); 
{code}

It seems like this ought to be a warning instead of an info.  Also, what does 
it mean to try a different hadoop mode?  It's not clear from this message what 
action the user should take.

I promise if you make another patch for this I'll look at it quickly so it 
doesn't go out of date on you again.

 Statistics (records read by each mapper and reducer)
 

 Key: PIG-626
 URL: https://issues.apache.org/jira/browse/PIG-626
 Project: Pig
  Issue Type: New Feature
  Components: impl
Affects Versions: 0.2.0
Reporter: Shubham Chopra
Priority: Minor
 Attachments: pigStats.patch, pigStats.patch, pigStats.patch, 
 pigStats.patch, TEST-org.apache.pig.test.TestBZip.txt


 This uses the counters framework that hadoop has. Initially, I am just 
 interested in finding out the number of records read by each mapper/reducer 
 particularly for the last job in any script. A sample code to access the 
 statistics for the last job:
 String reducePlan = 
 stats.getPigStats().get(stats.getLastJobID()).get(PIG_STATS_REDUCE_PLAN);
 if(reducePlan == null) {
 System.out.println(Records written :  + 
 stats.getPigStats().get(stats.getLastJobID()).get(PIG_STATS_MAP_OUTPUT_RECORDS));
 } else {
 System.out.println(Records written :  + 
 stats.getPigStats().get(stats.getLastJobID()).get(PIG_STATS_REDUCE_OUTPUT_RECORDS));
 }
 The patch contains 7 test cases. These include tests PigStorage and 
 BinStorage along with one for multiple MR jobs case.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (PIG-697) Proposed improvements to pig's optimizer

2009-05-01 Thread Alan Gates (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-697:
--

Assignee: Santhosh Srinivasan  (was: Alan Gates)

 Proposed improvements to pig's optimizer
 

 Key: PIG-697
 URL: https://issues.apache.org/jira/browse/PIG-697
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Alan Gates
Assignee: Santhosh Srinivasan
 Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch


 I propose the following changes to pig optimizer, plan, and operator 
 functionality to support more robust optimization:
 1) Remove the required array from Rule.  This will change rules so that they 
 only match exact patterns instead of allowing missing elements in the pattern.
 This has the downside that if a given rule applies to two patterns (say 
 Load-Filter-Group, Load-Group) you have to write two rules.  But it has 
 the upside that
 the resulting rules know exactly what they are getting.  The original intent 
 of this was to reduce the number of rules that needed to be written.  But the
 resulting rules have do a lot of work to understand the operators they are 
 working with.  With exact matches only, each rule will know exactly the 
 operators it
 is working on and can apply the logic of shifting the operators around.  All 
 four of the existing rules set all entries of required to true, so removing 
 this
 will have no effect on them.
 2) Change PlanOptimizer.optimize to iterate over the rules until there are no 
 conversions or a certain number of iterations has been reached.  Currently the
 function is:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 for (Rule rule : mRules) {
 if (matcher.match(rule)) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches)
 {
   if (rule.transformer.check(match)) {
   // The transformer approves.
   rule.transformer.transform(match);
   }
 }
 }
 }
 }
 {code}
 It would change to be:
 {code}
 public final void optimize() throws OptimizerException {
 RuleMatcher matcher = new RuleMatcher();
 boolean sawMatch;
 int iterators = 0;
 do {
 sawMatch = false;
 for (Rule rule : mRules) {
 ListListO matches = matcher.getAllMatches();
 for (ListO match:matches) {
 // It matches the pattern.  Now check if the transformer
 // approves as well.
 if (rule.transformer.check(match)) {
 // The transformer approves.
 sawMatch = true;
 rule.transformer.transform(match);
 }
 }
 }
 // Not sure if 1000 is the right number of iterations, maybe it
 // should be configurable so that large scripts don't stop too 
 // early.
 } while (sawMatch  numIterations++  1000);
 }
 {code}
 The reason for limiting the number of iterations is to avoid infinite loops.  
 The reason for iterating over the rules is so that each rule can be applied 
 multiple
 times as necessary.  This allows us to write simple rules, mostly swaps 
 between neighboring operators, without worrying that we get the plan right in 
 one pass.
 For example, we might have a plan that looks like:  
 Load-Join-Filter-Foreach, and we want to optimize it to 
 Load-Foreach-Filter-Join.  With two simple
 rules (swap filter and join and swap foreach and filter), applied 
 iteratively, we can get from the initial to final plan, without needing to 
 understanding the
 big picture of the entire plan.
 3) Add three calls to OperatorPlan:
 {code}
 /**
  * Swap two operators in a plan.  Both of the operators must have single
  * inputs and single outputs.
  * @param first operator
  * @param second operator
  * @throws PlanException if either operator is not single input and output.
  */
 public void swap(E first, E second) throws PlanException {
 ...
 }
 /**
  * Push one operator in front of another.  This function is for use when
  * the first operator has multiple inputs.  The caller can specify
  * which input of the first operator the second operator should be pushed to.
  * @param first operator, assumed to have multiple inputs.
  * @param second operator, will be pushed in front of first
  * @param inputNum, indicates which input of the first operator the second
  * operator

[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

2009-05-01 Thread Alan Gates (JIRA)

[
https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705134#action_12705134
]

Alan Gates commented on PIG-795:

I think it's fine to have sample as a keyword. It's valuable not just because
it's easier syntax, but because in the future it could be expanded to more
sophisticated sampling techniques beyond just taking a percentage of the data.
For example:

B = SAMPLE A 1 USING 'mywhizbangnewsmaplingalgorithm';

What I meant was your patch could translate SAMPLE underneath into a filter.
Then, instead of making changes in the limit code, all you need to do is move
RANDOM from piggybank into pig's builtins, and change QueryParser.jjt to do the
translation form SAMPLE to FILTER.

Command that selects a random sample of the rows, similar to LIMIT
--

Key: PIG-795
URL: https://issues.apache.org/jira/browse/PIG-795
Project: Pig
Issue Type: New Feature
Components: impl
Reporter: Eric Gaudet
Priority: Trivial
Attachments: sample2.diff

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-05-01 Thread Olga Natkovich (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705169#action_12705169
 ] 

Olga Natkovich commented on PIG-793:


This would require to compile code on the fly, right? Up till now we were 
trying to avoid.

 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-05-01 Thread Hong Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705188#action_12705188
 ] 

Hong Tang commented on PIG-793:
---

Two ideas:

# when loading tuple from serialized data, keep it as a byte array and only 
instantiate datums when get/set calls are made. This would help if we are 
moving tuples from one container to another container.
{code}
class LazyTuple implements Tuple {
  ArrayListObject fields; // null if not deserialized
  DataByteArray lazyBytes; // e.g. serialized bytes of tuple in avro format.
}
{code} 
# improving DataByteArray. it may be changed to an interface (need get(), 
offset(), and length() ), and use a DataByteArrayFactory to create instances in 
two ways: 
## DataByteArrayFactor.createPrivate(byte[], offset, length), if we need to 
keep a private copy of the buffer.
## DataByteArrayCreateShared(). if the input buffer can be shared with the data 
byte array object. In this case, the contract would be that caller will no 
longer access the portion of byte array from offset to offset+length 
(exclusive).

There could be three different implementations of this:
- The current implementation will be used for createPrivate().
- An implementation for small buffers (offset/length can be represented in 
short/short).
- An implementation for large buffers (offset/length are int/int, and length is 
larger enough)

Note that the change to DataByteArray would break the current semantics where 
the offset is always 0, and length is always the length of the buffer.


 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PIG-793) Improving memory efficiency of Tuple implementation

2009-05-01 Thread Hong Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705188#action_12705188
 ] 

Hong Tang edited comment on PIG-793 at 5/1/09 4:59 PM:
---

Two ideas:

# when loading tuple from serialized data, keep it as a byte array and only 
instantiate datums when get/set calls are made. This would help if we are 
moving tuples from one container to another container.
{code}
class LazyTuple implements Tuple {
  ArrayListObject fields; // null if not deserialized
  DataByteArray lazyBytes; // e.g. serialized bytes of tuple in avro format.
}
{code} 
# improving DataByteArray. it may be changed to an interface (need get(), 
offset(), and length() ), and use a DataByteArrayFactory to create instances in 
two ways: 
## DataByteArrayFactor.createPrivate(byte[], offset, length), if we need to 
keep a private copy of the buffer.
## DataByteArrayFactor.createShared(byte[], offset, length). if the input 
buffer can be shared with the data byte array object. In this case, the 
contract would be that caller will no longer access the portion of byte array 
from offset to offset+length (exclusive).

There could be three different implementations of this:
- The current implementation will be used for createPrivate().
- An implementation for small buffers (offset/length can be represented in 
short/short).
- An implementation for large buffers (offset/length are int/int, and length is 
larger enough)

Note that the change to DataByteArray would break the current semantics where 
the offset is always 0, and length is always the length of the buffer.


  was (Author: hong.tang):
Two ideas:

# when loading tuple from serialized data, keep it as a byte array and only 
instantiate datums when get/set calls are made. This would help if we are 
moving tuples from one container to another container.
{code}
class LazyTuple implements Tuple {
  ArrayListObject fields; // null if not deserialized
  DataByteArray lazyBytes; // e.g. serialized bytes of tuple in avro format.
}
{code} 
# improving DataByteArray. it may be changed to an interface (need get(), 
offset(), and length() ), and use a DataByteArrayFactory to create instances in 
two ways: 
## DataByteArrayFactor.createPrivate(byte[], offset, length), if we need to 
keep a private copy of the buffer.
## DataByteArrayCreateShared(). if the input buffer can be shared with the data 
byte array object. In this case, the contract would be that caller will no 
longer access the portion of byte array from offset to offset+length 
(exclusive).

There could be three different implementations of this:
- The current implementation will be used for createPrivate().
- An implementation for small buffers (offset/length can be represented in 
short/short).
- An implementation for large buffers (offset/length are int/int, and length is 
larger enough)

Note that the change to DataByteArray would break the current semantics where 
the offset is always 0, and length is always the length of the buffer.

  
 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

2009-05-01 Thread Eric Gaudet (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Gaudet updated PIG-795:


Attachment: sample3.diff

This is the implementation of the SAMPLE operator rewritten as FILTER by the 
query parser, as suggested by Olga and Alan. It uses a new built-in function 
RANDOM(), copied from piggybank. This patch also adds the unit test TestSample. 

I am unfamiliar with LogicalPlan crafting, so the code might not be the best. 
Please feel free to clean it up.



 Command that selects a random sample of the rows, similar to LIMIT
 --

 Key: PIG-795
 URL: https://issues.apache.org/jira/browse/PIG-795
 Project: Pig
  Issue Type: New Feature
  Components: impl
Reporter: Eric Gaudet
Priority: Trivial
 Attachments: sample2.diff, sample3.diff


 When working with very large data sets (imagine that!), running a pig script 
 can take time. It may be useful to run on a small subset of the data in some 
 situations (eg: debugging / testing, or to get fast results even if less 
 accurate.) 
 The command LIMIT N selects the first N rows of the data, but these are not 
 necessarily randomzed. A command SAMPLE X would retain the row only with 
 the probability x%.
 Note: it is possible to implement this feature with FILTER BY and an UDF, but 
 so is LIMIT, and limit is built-in.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-798) Schema errors when using PigStorage and none when using BinStorage??

2009-05-01 Thread Viraj Bhat (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-798:
---

Description: 
In the following script I have a tab separated text file, which I load using 
PigStorage() and store using BinStorage()
{code}
A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
url:chararray, time:chararray);

B = group A by name;

store B into '/user/viraj/binstoragecreateop' using BinStorage();

dump B;
{code}

I later load file 'binstoragecreateop' in the following way.
{code}

A = load '/user/viraj/binstoragecreateop' using BinStorage();

B = foreach A generate $0 as name:chararray;

dump B;
{code}
Result
===
(Amy)
(Fred)
===
The above code work properly and returns the right results. If I use 
PigStorage() to achieve the same, I get the following error.
{code}
A = load '/user/viraj/visits.txt' using PigStorage();

B = foreach A generate $0 as name:chararray;

dump B;

{code}
===
{code}
2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field 
Schema: name: chararray
Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
{code}
===
So why should the semantics of BinStorage() be different from PigStorage() 
where is ok not to specify a schema??? Should it not be consistent across both.

  was:
In the following script I have a tab separated text file, which I load using 
PigStorage() and store using BinStorage()
{code}
A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
url:chararray, time:chararray);

B = group A by name;

store B into '/user/viraj/binstoragecreateop' using BinStorage();

dump B;
{code}

I later load file 'binstoragecreateop' in the following way.
{code}

A = load '/user/viraj/binstoragecreateop' using BinStorage();

B = foreach A generate $0 as name:chararray;

dump B;
{code}
Result
===
(Amy)
(Fred)
===
The above code work properly and returns the right results. If I use 
PigStorage() to achieve the same, I get the following error.
{code}
A = load '/user/viraj/visits.txt' using PigStorage();

B = foreach A generate $0 as name:chararray;

dump B;

{code}
===
2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field 
Schema: name: chararray
Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
===
So why should the semantics of BinStorage() be different from PigStorage() 
where is ok not to specify a schema??? Should it not be consistent across both.


 Schema errors when using PigStorage and none when using BinStorage??
 

 Key: PIG-798
 URL: https://issues.apache.org/jira/browse/PIG-798
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
 Fix For: 0.2.0


 In the following script I have a tab separated text file, which I load using 
 PigStorage() and store using BinStorage()
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
 url:chararray, time:chararray);
 B = group A by name;
 store B into '/user/viraj/binstoragecreateop' using BinStorage();
 dump B;
 {code}
 I later load file 'binstoragecreateop' in the following way.
 {code}
 A = load '/user/viraj/binstoragecreateop' using BinStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 Result
 ===
 (Amy)
 (Fred)
 ===
 The above code work properly and returns the right results. If I use 
 PigStorage() to achieve the same, I get the following error.
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 ===
 {code}
 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
 Field Schema: name: chararray
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
 {code}

[jira] Created: (PIG-798) Schema errors when using PigStorage and none when using BinStorage??

2009-05-01 Thread Viraj Bhat (JIRA)

Schema errors when using PigStorage and none when using BinStorage??


 Key: PIG-798
 URL: https://issues.apache.org/jira/browse/PIG-798
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
 Fix For: 0.2.0


In the following script I have a tab separated text file, which I load using 
PigStorage() and store using BinStorage()
{code}
A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
url:chararray, time:chararray);

B = group A by name;

store B into '/user/viraj/binstoragecreateop' using BinStorage();

dump B;
{code}

I later load file 'binstoragecreateop' in the following way.
{code}

A = load '/user/viraj/binstoragecreateop' using BinStorage();

B = foreach A generate $0 as name:chararray;

dump B;
{code}
Result
===
(Amy)
(Fred)
===
The above code work properly and returns the right results. If I use 
PigStorage() to achieve the same, I get the following error.
{code}
A = load '/user/viraj/visits.txt' using PigStorage();

B = foreach A generate $0 as name:chararray;

dump B;

{code}
===
2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field 
Schema: name: chararray
Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
===
So why should the semantics of BinStorage() be different from PigStorage() 
where is ok not to specify a schema??? Should it not be consistent across both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

2009-05-01 Thread Viraj Bhat (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Bhat updated PIG-798:
---

Summary: Schema errors when using PigStorage and none when using BinStorage 
in FOREACH??  (was: Schema errors when using PigStorage and none when using 
BinStorage??)

 Schema errors when using PigStorage and none when using BinStorage in 
 FOREACH??
 ---

 Key: PIG-798
 URL: https://issues.apache.org/jira/browse/PIG-798
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.2.0
Reporter: Viraj Bhat
 Fix For: 0.2.0

 Attachments: binstoragecreateop, schemaerr.pig, visits.txt


 In the following script I have a tab separated text file, which I load using 
 PigStorage() and store using BinStorage()
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, 
 url:chararray, time:chararray);
 B = group A by name;
 store B into '/user/viraj/binstoragecreateop' using BinStorage();
 dump B;
 {code}
 I later load file 'binstoragecreateop' in the following way.
 {code}
 A = load '/user/viraj/binstoragecreateop' using BinStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 Result
 ===
 (Amy)
 (Fred)
 ===
 The above code work properly and returns the right results. If I use 
 PigStorage() to achieve the same, I get the following error.
 {code}
 A = load '/user/viraj/visits.txt' using PigStorage();
 B = foreach A generate $0 as name:chararray;
 dump B;
 {code}
 ===
 {code}
 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other 
 Field Schema: name: chararray
 Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log
 {code}
 ===
 So why should the semantics of BinStorage() be different from PigStorage() 
 where is ok not to specify a schema??? Should it not be consistent across 
 both.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

[jira] Created: (PIG-796) support conversion from numeric types to chararray

[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

[jira] Created: (PIG-797) Limit with ORDER BY producing wrong results

[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

[jira] Commented: (PIG-626) Statistics (records read by each mapper and reducer)

[jira] Assigned: (PIG-697) Proposed improvements to pig's optimizer

[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

[jira] Issue Comment Edited: (PIG-793) Improving memory efficiency of Tuple implementation

[jira] Updated: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT

[jira] Updated: (PIG-798) Schema errors when using PigStorage and none when using BinStorage??

[jira] Created: (PIG-798) Schema errors when using PigStorage and none when using BinStorage??

[jira] Updated: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??

15 matches

Site Navigation

Mail list logo

Footer information