[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT
[ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705037#action_12705037 ] Alan Gates commented on PIG-795: Eric, Thanks for the patch. I agree this is a feature that people will find useful. I have a few questions and comments: 1) Is 1% the minimum sample size people will want to work with? Given that data in the grid can be on the order of terabytes, I can see people wanting a 0.1% sample, or even 0.01% sample. Maybe that's too hard to specify nicely in the syntax, or maybe people will be happy with 1% minimum. I'm not sure, but it's worth thinking about. 2) Sample and limit aren't really related, so implementing this in limit seems artificial. Could it instead be implemented as a filter with a random function? So the grammar production would look like: X = SAMPLE Y a% = X = FILTER Y BY a RANDOM(); with RANDOM being a function you added to return a random number. The advantage of this is we would hope in the future to push filter operators down into the load functions themselves. intelligent load functions could then take this filter and not even deserialize a record until it decided whether it was going to be kept or not. 3) The patch should include unit tests. Command that selects a random sample of the rows, similar to LIMIT -- Key: PIG-795 URL: https://issues.apache.org/jira/browse/PIG-795 Project: Pig Issue Type: New Feature Components: impl Reporter: Eric Gaudet Priority: Trivial Attachments: sample2.diff When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) The command LIMIT N selects the first N rows of the data, but these are not necessarily randomzed. A command SAMPLE X would retain the row only with the probability x%. Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-796) support conversion from numeric types to chararray
support conversion from numeric types to chararray --- Key: PIG-796 URL: https://issues.apache.org/jira/browse/PIG-796 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT
[ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705072#action_12705072 ] Eric Gaudet commented on PIG-795: - Thanks for your feedback. (BTW, should these issues be discussed in a different place?) Here's my comments: 1) I agree that the 1% minimum looks arbitrary and annoying, but I decided to keep it like this for several reasons. Most importantly, I didn't want to disturb the syntax of LIMIT, which expects an integer. Secondly, 1% is a reasonable minimum if you want a statistically significant result. And finally, you can work around the limitation by adding a 2nd level of sample (or more): b = SAMPLE a 1; c = SAMPLE b 1; gives you 0.01%. Now that I think about it, it's easy to change the syntax and use a float for SAMPLE. The value would be a probability between 0.0 and 1.0. It's cleaner this way, and I will send a new patch for that. 2) I implemented it in limit because they are both specialized filters in a way, with a similar syntax. This way the code changes are very small. It already exists as a filter without any coding needed: b = FILTER a BY org.apache.pig.piggybank.evaluation.math.RANDOM()0.01; The syntax not very user friendly, though. 3) Will add unit tests in the new patch with floats. I will produce a new patch with the float syntax and unit tests in the next few days, unless you tell me you prefer FILTER BY. Command that selects a random sample of the rows, similar to LIMIT -- Key: PIG-795 URL: https://issues.apache.org/jira/browse/PIG-795 Project: Pig Issue Type: New Feature Components: impl Reporter: Eric Gaudet Priority: Trivial Attachments: sample2.diff When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) The command LIMIT N selects the first N rows of the data, but these are not necessarily randomzed. A command SAMPLE X would retain the row only with the probability x%. Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-797) Limit with ORDER BY producing wrong results
Limit with ORDER BY producing wrong results --- Key: PIG-797 URL: https://issues.apache.org/jira/browse/PIG-797 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Olga Natkovich Query: A = load 'studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, SUM(A.gpa) as rev; D = order C by rev; E = limit D 10; dump E; Output: (alice king,31.7) (alice laertes,26.453) (alice thompson,25.867) (alice van buren,23.59) (bob allen,19.902) (bob ichabod,29.0) (bob king,28.454) (bob miller,10.28) (bob underhill,28.137) (bob van buren,25.992) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT
[ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705104#action_12705104 ] Olga Natkovich commented on PIG-795: Can we implement SAMPLE the same way we implement JOIN - as a macro. This way we will achieve the readability without much code changes. Command that selects a random sample of the rows, similar to LIMIT -- Key: PIG-795 URL: https://issues.apache.org/jira/browse/PIG-795 Project: Pig Issue Type: New Feature Components: impl Reporter: Eric Gaudet Priority: Trivial Attachments: sample2.diff When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) The command LIMIT N selects the first N rows of the data, but these are not necessarily randomzed. A command SAMPLE X would retain the row only with the probability x%. Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-626) Statistics (records read by each mapper and reducer)
[ https://issues.apache.org/jira/browse/PIG-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705152#action_12705152 ] Alan Gates commented on PIG-626: Shubham, I apologize for being so slow to get to this. I see several issues in the patch. 1) The launcher code has changed significantly. I tried to figure out how to paste your code in, but I was afraid I was doing it wrong so I backed off. You'll need to integrate your patch with the changes. 2) All error messages now require an error code (see http://wiki.apache.org/pig/PigErrorHandlingFunctionalSpecification). 3) In a couple places when the code can't find the statistics information it says: {code} log.info(Jobs not found in the JobClient. Please try using a different Hadoop mode); {code} It seems like this ought to be a warning instead of an info. Also, what does it mean to try a different hadoop mode? It's not clear from this message what action the user should take. I promise if you make another patch for this I'll look at it quickly so it doesn't go out of date on you again. Statistics (records read by each mapper and reducer) Key: PIG-626 URL: https://issues.apache.org/jira/browse/PIG-626 Project: Pig Issue Type: New Feature Components: impl Affects Versions: 0.2.0 Reporter: Shubham Chopra Priority: Minor Attachments: pigStats.patch, pigStats.patch, pigStats.patch, pigStats.patch, TEST-org.apache.pig.test.TestBZip.txt This uses the counters framework that hadoop has. Initially, I am just interested in finding out the number of records read by each mapper/reducer particularly for the last job in any script. A sample code to access the statistics for the last job: String reducePlan = stats.getPigStats().get(stats.getLastJobID()).get(PIG_STATS_REDUCE_PLAN); if(reducePlan == null) { System.out.println(Records written : + stats.getPigStats().get(stats.getLastJobID()).get(PIG_STATS_MAP_OUTPUT_RECORDS)); } else { System.out.println(Records written : + stats.getPigStats().get(stats.getLastJobID()).get(PIG_STATS_REDUCE_OUTPUT_RECORDS)); } The patch contains 7 test cases. These include tests PigStorage and BinStorage along with one for multiple MR jobs case. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-697) Proposed improvements to pig's optimizer
[ https://issues.apache.org/jira/browse/PIG-697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-697: -- Assignee: Santhosh Srinivasan (was: Alan Gates) Proposed improvements to pig's optimizer Key: PIG-697 URL: https://issues.apache.org/jira/browse/PIG-697 Project: Pig Issue Type: Bug Components: impl Reporter: Alan Gates Assignee: Santhosh Srinivasan Attachments: OptimizerPhase1.patch, OptimizerPhase1_part2.patch I propose the following changes to pig optimizer, plan, and operator functionality to support more robust optimization: 1) Remove the required array from Rule. This will change rules so that they only match exact patterns instead of allowing missing elements in the pattern. This has the downside that if a given rule applies to two patterns (say Load-Filter-Group, Load-Group) you have to write two rules. But it has the upside that the resulting rules know exactly what they are getting. The original intent of this was to reduce the number of rules that needed to be written. But the resulting rules have do a lot of work to understand the operators they are working with. With exact matches only, each rule will know exactly the operators it is working on and can apply the logic of shifting the operators around. All four of the existing rules set all entries of required to true, so removing this will have no effect on them. 2) Change PlanOptimizer.optimize to iterate over the rules until there are no conversions or a certain number of iterations has been reached. Currently the function is: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); for (Rule rule : mRules) { if (matcher.match(rule)) { // It matches the pattern. Now check if the transformer // approves as well. ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { if (rule.transformer.check(match)) { // The transformer approves. rule.transformer.transform(match); } } } } } {code} It would change to be: {code} public final void optimize() throws OptimizerException { RuleMatcher matcher = new RuleMatcher(); boolean sawMatch; int iterators = 0; do { sawMatch = false; for (Rule rule : mRules) { ListListO matches = matcher.getAllMatches(); for (ListO match:matches) { // It matches the pattern. Now check if the transformer // approves as well. if (rule.transformer.check(match)) { // The transformer approves. sawMatch = true; rule.transformer.transform(match); } } } // Not sure if 1000 is the right number of iterations, maybe it // should be configurable so that large scripts don't stop too // early. } while (sawMatch numIterations++ 1000); } {code} The reason for limiting the number of iterations is to avoid infinite loops. The reason for iterating over the rules is so that each rule can be applied multiple times as necessary. This allows us to write simple rules, mostly swaps between neighboring operators, without worrying that we get the plan right in one pass. For example, we might have a plan that looks like: Load-Join-Filter-Foreach, and we want to optimize it to Load-Foreach-Filter-Join. With two simple rules (swap filter and join and swap foreach and filter), applied iteratively, we can get from the initial to final plan, without needing to understanding the big picture of the entire plan. 3) Add three calls to OperatorPlan: {code} /** * Swap two operators in a plan. Both of the operators must have single * inputs and single outputs. * @param first operator * @param second operator * @throws PlanException if either operator is not single input and output. */ public void swap(E first, E second) throws PlanException { ... } /** * Push one operator in front of another. This function is for use when * the first operator has multiple inputs. The caller can specify * which input of the first operator the second operator should be pushed to. * @param first operator, assumed to have multiple inputs. * @param second operator, will be pushed in front of first * @param inputNum, indicates which input of the first operator the second * operator
[jira] Commented: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT
[ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705134#action_12705134 ] Alan Gates commented on PIG-795: I think it's fine to have sample as a keyword. It's valuable not just because it's easier syntax, but because in the future it could be expanded to more sophisticated sampling techniques beyond just taking a percentage of the data. For example: B = SAMPLE A 1 USING 'mywhizbangnewsmaplingalgorithm'; What I meant was your patch could translate SAMPLE underneath into a filter. Then, instead of making changes in the limit code, all you need to do is move RANDOM from piggybank into pig's builtins, and change QueryParser.jjt to do the translation form SAMPLE to FILTER. Command that selects a random sample of the rows, similar to LIMIT -- Key: PIG-795 URL: https://issues.apache.org/jira/browse/PIG-795 Project: Pig Issue Type: New Feature Components: impl Reporter: Eric Gaudet Priority: Trivial Attachments: sample2.diff When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) The command LIMIT N selects the first N rows of the data, but these are not necessarily randomzed. A command SAMPLE X would retain the row only with the probability x%. Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation
[ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705169#action_12705169 ] Olga Natkovich commented on PIG-793: This would require to compile code on the fly, right? Up till now we were trying to avoid. Improving memory efficiency of Tuple implementation --- Key: PIG-793 URL: https://issues.apache.org/jira/browse/PIG-793 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Currently, our tuple is a real pig and uses a lot of extra memory. There are several places where we can improve memory efficiency: (1) Laying out memory for the fields rather than using java objects since since each object for a numeric field takes 16 bytes (2) For the cases where we know the schema using Java arrays rather than ArrayList. There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation
[ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705188#action_12705188 ] Hong Tang commented on PIG-793: --- Two ideas: # when loading tuple from serialized data, keep it as a byte array and only instantiate datums when get/set calls are made. This would help if we are moving tuples from one container to another container. {code} class LazyTuple implements Tuple { ArrayListObject fields; // null if not deserialized DataByteArray lazyBytes; // e.g. serialized bytes of tuple in avro format. } {code} # improving DataByteArray. it may be changed to an interface (need get(), offset(), and length() ), and use a DataByteArrayFactory to create instances in two ways: ## DataByteArrayFactor.createPrivate(byte[], offset, length), if we need to keep a private copy of the buffer. ## DataByteArrayCreateShared(). if the input buffer can be shared with the data byte array object. In this case, the contract would be that caller will no longer access the portion of byte array from offset to offset+length (exclusive). There could be three different implementations of this: - The current implementation will be used for createPrivate(). - An implementation for small buffers (offset/length can be represented in short/short). - An implementation for large buffers (offset/length are int/int, and length is larger enough) Note that the change to DataByteArray would break the current semantics where the offset is always 0, and length is always the length of the buffer. Improving memory efficiency of Tuple implementation --- Key: PIG-793 URL: https://issues.apache.org/jira/browse/PIG-793 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Currently, our tuple is a real pig and uses a lot of extra memory. There are several places where we can improve memory efficiency: (1) Laying out memory for the fields rather than using java objects since since each object for a numeric field takes 16 bytes (2) For the cases where we know the schema using Java arrays rather than ArrayList. There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PIG-793) Improving memory efficiency of Tuple implementation
[ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705188#action_12705188 ] Hong Tang edited comment on PIG-793 at 5/1/09 4:59 PM: --- Two ideas: # when loading tuple from serialized data, keep it as a byte array and only instantiate datums when get/set calls are made. This would help if we are moving tuples from one container to another container. {code} class LazyTuple implements Tuple { ArrayListObject fields; // null if not deserialized DataByteArray lazyBytes; // e.g. serialized bytes of tuple in avro format. } {code} # improving DataByteArray. it may be changed to an interface (need get(), offset(), and length() ), and use a DataByteArrayFactory to create instances in two ways: ## DataByteArrayFactor.createPrivate(byte[], offset, length), if we need to keep a private copy of the buffer. ## DataByteArrayFactor.createShared(byte[], offset, length). if the input buffer can be shared with the data byte array object. In this case, the contract would be that caller will no longer access the portion of byte array from offset to offset+length (exclusive). There could be three different implementations of this: - The current implementation will be used for createPrivate(). - An implementation for small buffers (offset/length can be represented in short/short). - An implementation for large buffers (offset/length are int/int, and length is larger enough) Note that the change to DataByteArray would break the current semantics where the offset is always 0, and length is always the length of the buffer. was (Author: hong.tang): Two ideas: # when loading tuple from serialized data, keep it as a byte array and only instantiate datums when get/set calls are made. This would help if we are moving tuples from one container to another container. {code} class LazyTuple implements Tuple { ArrayListObject fields; // null if not deserialized DataByteArray lazyBytes; // e.g. serialized bytes of tuple in avro format. } {code} # improving DataByteArray. it may be changed to an interface (need get(), offset(), and length() ), and use a DataByteArrayFactory to create instances in two ways: ## DataByteArrayFactor.createPrivate(byte[], offset, length), if we need to keep a private copy of the buffer. ## DataByteArrayCreateShared(). if the input buffer can be shared with the data byte array object. In this case, the contract would be that caller will no longer access the portion of byte array from offset to offset+length (exclusive). There could be three different implementations of this: - The current implementation will be used for createPrivate(). - An implementation for small buffers (offset/length can be represented in short/short). - An implementation for large buffers (offset/length are int/int, and length is larger enough) Note that the change to DataByteArray would break the current semantics where the offset is always 0, and length is always the length of the buffer. Improving memory efficiency of Tuple implementation --- Key: PIG-793 URL: https://issues.apache.org/jira/browse/PIG-793 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Currently, our tuple is a real pig and uses a lot of extra memory. There are several places where we can improve memory efficiency: (1) Laying out memory for the fields rather than using java objects since since each object for a numeric field takes 16 bytes (2) For the cases where we know the schema using Java arrays rather than ArrayList. There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-795) Command that selects a random sample of the rows, similar to LIMIT
[ https://issues.apache.org/jira/browse/PIG-795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Gaudet updated PIG-795: Attachment: sample3.diff This is the implementation of the SAMPLE operator rewritten as FILTER by the query parser, as suggested by Olga and Alan. It uses a new built-in function RANDOM(), copied from piggybank. This patch also adds the unit test TestSample. I am unfamiliar with LogicalPlan crafting, so the code might not be the best. Please feel free to clean it up. Command that selects a random sample of the rows, similar to LIMIT -- Key: PIG-795 URL: https://issues.apache.org/jira/browse/PIG-795 Project: Pig Issue Type: New Feature Components: impl Reporter: Eric Gaudet Priority: Trivial Attachments: sample2.diff, sample3.diff When working with very large data sets (imagine that!), running a pig script can take time. It may be useful to run on a small subset of the data in some situations (eg: debugging / testing, or to get fast results even if less accurate.) The command LIMIT N selects the first N rows of the data, but these are not necessarily randomzed. A command SAMPLE X would retain the row only with the probability x%. Note: it is possible to implement this feature with FILTER BY and an UDF, but so is LIMIT, and limit is built-in. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-798) Schema errors when using PigStorage and none when using BinStorage??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-798: --- Description: In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === {code} 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log {code} === So why should the semantics of BinStorage() be different from PigStorage() where is ok not to specify a schema??? Should it not be consistent across both. was: In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log === So why should the semantics of BinStorage() be different from PigStorage() where is ok not to specify a schema??? Should it not be consistent across both. Schema errors when using PigStorage and none when using BinStorage?? Key: PIG-798 URL: https://issues.apache.org/jira/browse/PIG-798 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Fix For: 0.2.0 In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === {code} 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log {code}
[jira] Created: (PIG-798) Schema errors when using PigStorage and none when using BinStorage??
Schema errors when using PigStorage and none when using BinStorage?? Key: PIG-798 URL: https://issues.apache.org/jira/browse/PIG-798 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Fix For: 0.2.0 In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log === So why should the semantics of BinStorage() be different from PigStorage() where is ok not to specify a schema??? Should it not be consistent across both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-798) Schema errors when using PigStorage and none when using BinStorage in FOREACH??
[ https://issues.apache.org/jira/browse/PIG-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Bhat updated PIG-798: --- Summary: Schema errors when using PigStorage and none when using BinStorage in FOREACH?? (was: Schema errors when using PigStorage and none when using BinStorage??) Schema errors when using PigStorage and none when using BinStorage in FOREACH?? --- Key: PIG-798 URL: https://issues.apache.org/jira/browse/PIG-798 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Viraj Bhat Fix For: 0.2.0 Attachments: binstoragecreateop, schemaerr.pig, visits.txt In the following script I have a tab separated text file, which I load using PigStorage() and store using BinStorage() {code} A = load '/user/viraj/visits.txt' using PigStorage() as (name:chararray, url:chararray, time:chararray); B = group A by name; store B into '/user/viraj/binstoragecreateop' using BinStorage(); dump B; {code} I later load file 'binstoragecreateop' in the following way. {code} A = load '/user/viraj/binstoragecreateop' using BinStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} Result === (Amy) (Fred) === The above code work properly and returns the right results. If I use PigStorage() to achieve the same, I get the following error. {code} A = load '/user/viraj/visits.txt' using PigStorage(); B = foreach A generate $0 as name:chararray; dump B; {code} === {code} 2009-05-02 03:58:50,662 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1022: Type mismatch merging schema prefix. Field Schema: bytearray. Other Field Schema: name: chararray Details at logfile: /home/viraj/pig-svn/trunk/pig_1241236728311.log {code} === So why should the semantics of BinStorage() be different from PigStorage() where is ok not to specify a schema??? Should it not be consistent across both. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.