Re: Any reason a bunch of nearly-identical jobs would suddenly stop working?

2011-03-09 Thread Mridul Muralidharan
Did you try checking the task logs ? There might be more details there ... Regards, Mridul On Wednesday 09 March 2011 04:23 AM, Kris Coward wrote: So I queued up a batch of jobs last night to run overnight (and into the day a bit, owing to to a bottleneck on the scheduler the way that things

Re: Schema

2011-03-09 Thread Mridul Muralidharan
In which case, cant you not model that as a Bag ? I imagine something like Tuple with fields person:chararray, books_read:bag{ (name:chararray, isbn:chararray) }, etc ? Ofcourse, it will work as a bag if the tuple contained within it has a fixed schema :-) (unless you repeat this process N nu

Re: [VOTE] Pig 1.0!

2011-03-08 Thread Mridul Muralidharan
As I elaborated before, given state of pig project, I would vote "-1" on next release being 1.0 Ofcourse, it is as mentioned, non binding :-) Regards, Mridul On Tuesday 08 March 2011 04:51 AM, Olga Natkovich wrote: Hi guys, We had a lively discussion last week regarding what version number

Re: [DISCUSSION] Pig.next

2011-03-04 Thread Mridul Muralidharan
IMO 1.0 for a product typically promises : 1) Reasonable stability of interfaces. Typically only major version changes break interface compatibility. While we are at 0.x, it seems to be considered 'okish' to violate this : but once you are at 1.0 and higher, breaking interface contracts will

Re: XMLLoader

2011-03-01 Thread Mridul Muralidharan
Since XMLLoader does not seem to satisfy your requirements, and assuming each line contains an xml document (which is required by XmlLoader anyway iirc) what you can do is write a simple udf to handle this. Use a line reader as loadfunc, and write a udf which parses the input line as a Docum

Re: Comparison between long

2010-12-16 Thread Mridul Muralidharan
On Thursday 16 December 2010 03:58 AM, John Hui wrote: The outputSchema is set to Long 90 @Override 91 public Schema outputSchema(Schema input) { 92 return new Schema(new Schema.FieldSchema(getSchemaName(this.getClass ().getName().toLowerCase(), input), DataType.CHARARRAY

Re: matches with regular expression in pig

2010-12-03 Thread Mridul Muralidharan
That is a very nice tip, thanks ! Regards, Mridul On Friday 03 December 2010 02:49 PM, Anze wrote: You could also try 'abc[|].*'. I find it is often easier (and less error- prone) to use this principle than it is to escape the escaping character... :) Just be careful with '-', it must be at

Re: Writing filter function that takes constructor param?

2010-12-02 Thread Mridul Muralidharan
As of now, udf's are limited to only String's as constructor params. Regards, Mridul On Thursday 02 December 2010 02:18 PM, Sheeba George wrote: Hi Daniel I have a related question. My UDF has a constructor that takes 2 param. * public* TopUDF(*int* top, *int* type){ m_cnt = top; m_type

Re: Regarding Multifile InputFormat patch

2010-10-29 Thread Mridul Muralidharan
It would be a tradeoff between data-locality versus number of tasks executed. In some of our experiments, it performed much worse (dont have actual numbers, but it was in the 2x ballpark iirc) : ofcourse, ours was a highly constrained and specialized experiment anyway ! On the other hand, th

[jira] Updated: (PIG-1685) Pig is unable to handle counters for glob paths ?

2010-10-26 Thread Mridul Muralidharan (JIRA)
[ https://issues.apache.org/jira/browse/PIG-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan updated PIG-1685: - Description: We get the following exception, which seems to be related to processing

[jira] Commented: (PIG-1685) Pig is unable to handle counters for glob paths ?

2010-10-20 Thread Mridul Muralidharan (JIRA)
[ https://issues.apache.org/jira/browse/PIG-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923156#action_12923156 ] Mridul Muralidharan commented on PIG-1685: -- Thanks guys, that was real q

[jira] Created: (PIG-1685) Pig is unable to handle counters for glob paths ?

2010-10-18 Thread Mridul Muralidharan (JIRA)
Versions: 0.8.0 Reporter: Mridul Muralidharan We get the following exception, which seems to be related to processing counters per path : java.net.URISyntaxException: Illegal character in path at index 71: /projects/gridfaces/mridulm/doopdex/k_data_index/20100830_cdxcore_10.7_

[jira] Commented: (PIG-1684) Inconsistent usage of store func.

2010-10-17 Thread Mridul Muralidharan (JIRA)
[ https://issues.apache.org/jira/browse/PIG-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921908#action_12921908 ] Mridul Muralidharan commented on PIG-1684: -- I am not sure if I understand

[jira] Created: (PIG-1684) Inconsistent usage of store func.

2010-10-16 Thread Mridul Muralidharan (JIRA)
: A custom StoreFuncInterface used to store data at the reducer. (Output of a group ) Reporter: Mridul Muralidharan Pig seems to be using multiple instances of StoreFuncInterface in the reducer inconsistently. Some hadoop api calls are made to one instance and others made to other

Re: Filtering on bag

2010-09-05 Thread Mridul Muralidharan
I did not follow your pig snippet ... it looks wrong (since only output is 'group'). Could you do a "order by" and then a "limit" ? I cant remember offhand if "order by" works within nested foreach (dont have pig access right now to test, sorry). If it is supported, something like might b

[jira] Commented: (PIG-1309) Sort Merge Cogroup

2010-09-03 Thread Mridul Muralidharan (JIRA)
[ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905859#action_12905859 ] Mridul Muralidharan commented on PIG-1309: -- Condition (1) refers to only expl

Re: [jira] Updated: (PIG-1309) Map-side Cogroup

2010-09-03 Thread Mridul Muralidharan
Condition (1) refers to only explicit (user specified) statements right ? Not implicit project introduced by pig to conform to schema ? Regards, Mridul On Saturday 21 August 2010 12:59 AM, Ashutosh Chauhan (JIRA) wrote: [ https://issues.apache.org/jira/browse/PIG-1309?page=com.atlass

Re: Working on multiple rows

2010-08-29 Thread Mridul Muralidharan
Taking a guess, you could group things based on your criterion and condition. Something simple like : a) group by usergroup (might be too expensive ? number of records across timestamps for users in a group might be large !). b) group by (usergroup, timestamp / window) [this will loose acc

Re: COUNT(A.field1)

2010-08-29 Thread Mridul Muralidharan
sting for us will be dismissed and not passed to the reducer part of the job, and besides wouldn't the presence of null values affect the performance? For example, if a2 would have many null values, then less values would be passed too right? Renato M. 2010/8/27 Mridul Muralidharan On seco

Re: COUNT(A.field1)

2010-08-27 Thread Mridul Muralidharan
On second thoughts, that part is obvious - duh - Mridul On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote: But it does for COUNT(A.a2) ? That is interesting, and somehow weird :) Thanks ! Mridul On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote: I think if you do

Re: COUNT(A.field1)

2010-08-26 Thread Mridul Muralidharan
But it does for COUNT(A.a2) ? That is interesting, and somehow weird :) Thanks ! Mridul On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote: I think if you do COUNT(A), Pig will not realize it can ignore a2 and a3, and project all of them. On Wed, Aug 25, 2010 at 4:31 PM, Mridul

Re: Group By data

2010-08-25 Thread Mridul Muralidharan
One possibility might be some bug in use of combiner. You could try disabling them and seeing if it works ... Regards, Mridul On Wednesday 25 August 2010 01:42 AM, Wasti, Syed wrote: Hi, I have a very simple script and seeing a very strange behavior, getting wrong results when running this scr

Re: COUNT(A.field1)

2010-08-25 Thread Mridul Muralidharan
I am not sure why second option is better - in both cases, you are shipping only the combined counts from map to reduce. On other hand, first could be better since it means we need to project only 'a1' - and none of the other fields. Or did I miss something here ? I am not very familiar to wh

[jira] Commented: (PIG-1321) Logical Optimizer: Merge cascading foreach

2010-08-25 Thread Mridul Muralidharan (JIRA)
[ https://issues.apache.org/jira/browse/PIG-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902392#action_12902392 ] Mridul Muralidharan commented on PIG-1321: -- Is the merge prevented only if fla

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-25 Thread Mridul Muralidharan (JIRA)
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902350#action_12902350 ] Mridul Muralidharan commented on PIG-1518: -- Might be a good idea to con

Re: ORDER Issue (repost to avoid spam filters)

2010-08-19 Thread Mridul Muralidharan
Are you using pig local mode ? If yes, does this work with hadoop ? Regards, Mridul On Friday 20 August 2010 12:05 AM, Matthew Smith wrote: All, I am running pig-0.7.0 and I have been running into an issue running the ORDER command. I have attempted to run pig out of the box on 2 separate L

[jira] Commented: (PIG-1518) multi file input format for loaders

2010-08-18 Thread Mridul Muralidharan (JIRA)
[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1292#action_1292 ] Mridul Muralidharan commented on PIG-1518: -- if optimizer is turned off, does

[jira] Commented: (PIG-365) Map side optimization for Limit (top k case)

2010-08-18 Thread Mridul Muralidharan (JIRA)
[ https://issues.apache.org/jira/browse/PIG-365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1280#action_1280 ] Mridul Muralidharan commented on PIG-365: - collecting only top k per mappe

Re: Adding entries to classpath

2010-08-12 Thread Mridul Muralidharan
lasses you create via reflection. Regards, Mridul Right now, my workaround is fairly robust but ugly - I am adding the top-level jar to HADOOP-CLASSPATH. That jar lists a.jar, b.jar, ... in the list of files in Class-Path in META-INF/MANIFEST.MF. -sanjay -Original Message- From: Mridu

Re: Adding entries to classpath

2010-08-12 Thread Mridul Muralidharan
A short term alternative would be to find out the order in which pig expands the jars, and ensure that your jars are expanded in reverse order. As in, if you need your classpath to be "a.jar:b.jar:c.jar", and pig un-jar's the register'ed jar in the order they are specified in the script, the

Re: question about making a UDF? (javax.media.jai.JAI) (java advanced imaging)

2010-08-10 Thread Mridul Muralidharan
You need the media framework, and would need to register those jars too for pig to 'find' the relevant classes : looks like they might not be part of plain jdk ? Regards, Mridul On Wednesday 11 August 2010 03:35 AM, Ifeanyichukwu Osuji wrote: The UDF i am making uses JAI from the javax.medi

Re: UDF. Change outputSchema from Evaluate

2010-08-06 Thread Mridul Muralidharan
If I understood your problem right, you can use define to pass parameters to constructor and then use that (after populating it into a instance field). -- note, only String's are accepted as parameters ! define MY_UDF org.me.udfp.MyUDF('param1', 'param2'); --- This will call the constructor

[jira] Commented: (PIG-1530) PIG Logical Optimization: Push LOFilter above LOCogroup

2010-08-01 Thread Mridul Muralidharan (JIRA)
[ https://issues.apache.org/jira/browse/PIG-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894388#action_12894388 ] Mridul Muralidharan commented on PIG-1530: -- Cant edit comments .. to ad

[jira] Commented: (PIG-1530) PIG Logical Optimization: Push LOFilter above LOCogroup

2010-08-01 Thread Mridul Muralidharan (JIRA)
[ https://issues.apache.org/jira/browse/PIG-1530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12894368#action_12894368 ] Mridul Muralidharan commented on PIG-1530: -- This looks more like a developer co

Re: Pig Data types issue

2010-07-30 Thread Mridul Muralidharan
D '$inputPath' using my_custom_loader(); describe raw; -- Regards, -Rohini -----Original Message- From: Mridul Muralidharan [mailto:mrid...@yahoo-inc.com] Sent: Friday, July 30, 2010 4:24 AM To: pig-user@hadoop.apache.org Cc: Uppuluri, Rohini Subject: Re: Pig Data types

Re: Pig Data types issue

2010-07-29 Thread Mridul Muralidharan
Are you returning the appropriate schema and using the correct schema in the pig script ? More info might help though ! Regards, Mridul On Thursday 29 July 2010 09:16 PM, Uppuluri, Rohini wrote: Hi all, I have a strange issue with data types. We have a custom loader which loads data from lo

Re: Group by is not working with Filter

2010-07-29 Thread Mridul Muralidharan
On Thursday 29 July 2010 01:18 AM, Swati Jain wrote: Hello Everyone, I am trying to execute below mentioned script, but it is throwing error. Script is: A = load 'ex_groupby' USING PigStorage(',') as (a1:int,a2:int,a3:int); G1 = GROUP A by (a1,a2); describe G1; *D = Filter G1 by group.$0> 1;*

Re: best way for pig and mapreduce jobs to be used interchangeably

2010-07-28 Thread Mridul Muralidharan
ay 28 July 2010 08:55 PM, Corbin Hoenes wrote: Mridul - What file format do you use to exchange data between pig and java? Text or something else? On Jul 25, 2010, at 1:52 PM, Mridul Muralidharan wrote: In some of our pipelines, pig jobs are part of the pipeline - which consist of other h

Re: image processing on a low level using PIG...Possible?

2010-07-26 Thread Mridul Muralidharan
Hi, We have a few projects which do this on hadoop, but I dont see any reason why it cant have been done in pig. As Alan and Ashutosh mentioned, the image itself will be just bytearray (and so you need your own loader, or in our case use a sequence file loader) : but you can extract and pop

Re: best way for pig and mapreduce jobs to be used interchangeably

2010-07-25 Thread Mridul Muralidharan
In some of our pipelines, pig jobs are part of the pipeline - which consist of other hadoop jobs, shell executions, etc. We currently do this by using intermediate file dumps. Regards, Mridul On Friday 23 July 2010 10:45 PM, Corbin Hoenes wrote: What are some strategies to have pig and j

Re: Any better way to ensure unicity ?

2010-07-15 Thread Mridul Muralidharan
chararray,start: long} modified schema: sessions: {first::sid: chararray,first::infoid: chararray,first::imei: chararray,first::start: long} Do you know a workaround ? Le 13/07/10 10:13, Mridul Muralidharan a écrit : The flatten will return the same schema as before (in 'first') : so u

Re: Any better way to ensure unicity ?

2010-07-13 Thread Mridul Muralidharan
PM, Vincent Barat wrote: Yes. I would have used DISTINCT too, but I cannot, since some of the other fields can be different (the timestamp actually). Thanks for your help. Le 13/07/10 11:06, Mridul Muralidharan a écrit : I am not sure why the prefix 'first' is coming in ... someon

Re: Any better way to ensure unicity ?

2010-07-13 Thread Mridul Muralidharan
t::infoid: chararray,first::imei: chararray,first::start: long} Do you know a workaround ? Le 13/07/10 10:13, Mridul Muralidharan a écrit : The flatten will return the same schema as before (in 'first') : so unless you are modifying the fields or the order in which they are generated

Re: Any better way to ensure unicity ?

2010-07-13 Thread Mridul Muralidharan
xactly same as start of the code snippet for 'sessions'. Regards, Mridul On Tuesday 13 July 2010 01:01 PM, Vincent Barat wrote: Le 12/07/10 16:56, Mridul Muralidharan a écrit : I am not sure what you mean here exactly. Will a sid row have multiple (different) values for the oth

Re: Any better way to ensure unicity ?

2010-07-12 Thread Mridul Muralidharan
I am not sure what you mean here exactly. Will a sid row have multiple (different) values for the other fields ? If not, that is, you can simply have duplicates for rows : you can use DISTINCT to achieve what you require : sessions = DISTINCT sessions PARALLEL $PARALLELISM; But if you wan

Re: UDF and rdbms lookups

2010-07-07 Thread Mridul Muralidharan
You will need to look at lifecycle of a udf to better understand this. Typically they are created (note: one or more creations !) during plan creation time (before job submission) and subsequently deserialized on the various mapper/reducer nodes to get executed (iirc). So typically what I ha

Re: Pig at LinkedIn

2010-06-24 Thread Mridul Muralidharan
As an aside, if you are using Azkaban for purpose of cron, etc - you might want to take a look at oozie : I think it has been released - and iirc going to be opensourced too. Regards, Mridul On Friday 25 June 2010 12:21 AM, Russell Jurney wrote: Wrote a... thing about Pig at LinkedIn that

Re: including multiple delimited fields (of unknown count) into one

2010-05-20 Thread Mridul Muralidharan
ally insert the field definitions in my script before I run it. So in the example above I would insert 'f1, f2, f3' everywhere I need to reference the tuple. Another run might insert 'f1, f2' for an input that only has 2 extra fields. On Thu, May 20, 2010 at 12:39 AM, Mridu

Re: including multiple delimited fields (of unknown count) into one

2010-05-20 Thread Mridul Muralidharan
uld I access the items in the numbered fields 3..N where I don't know what N is? Are you suggesting I pass A to a custom UDF to convert to a tuple of [time, count, rest_of_line]? On Wed, May 19, 2010 at 4:11 PM, Mridul Muralidharan mailto:mrid...@yahoo-inc.com>> wrote: You can simply

Re: including multiple delimited fields (of unknown count) into one

2010-05-19 Thread Mridul Muralidharan
You can simply skip specifying schema in the load - and access the fields either through the udf or through $0, etc positional indexes. Like : A = load 'myfile' USING PigStorage(); B = GROUP A by round_hour($0) PARALLEL $PARALLELISM; C = ... Regards, Mridul On Thursday 20 May 2010 04:07

distcp of small number of really large files

2010-05-17 Thread Mridul Muralidharan
Hi, Is there a way to parallelize copy of really large files ? From my understanding, currently a each map in distcp copies one file. So for really large files, this would be pretty slow if number of files is really large. Thanks, Mridul

[jira] Commented: (PIG-566) Dump and store outputs do not match for PigStorage

2010-05-17 Thread Mridul Muralidharan (JIRA)
[ https://issues.apache.org/jira/browse/PIG-566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868155#action_12868155 ] Mridul Muralidharan commented on PIG-566: - Just to point out an error in the com

Re: SpillableMemoryManager - low memory handler called

2010-05-10 Thread Mridul Muralidharan
ing is actually spilled. This gets printed out even if there are no spillable objects the Manager is aware of. An 8G map will certainly trigger the GC. On Fri, May 7, 2010 at 2:44 PM, Mridul Muralidharan wrote: Hi, Do you know which snippet in the script is causing the issue ? There are m

Re: SpillableMemoryManager - low memory handler called

2010-05-07 Thread Mridul Muralidharan
Hi, Do you know which snippet in the script is causing the issue ? There are multiple MR jobs which will be executed, what is causing the exact issue ? Map side spills is strange - are you sure it is not in the reducer ? If it really is in the map side, I guess it is pointing to the case

Re: short circuiting the pig ? operator

2010-04-28 Thread Mridul Muralidharan
I am not very sure what are all the runtime implications of some of pig idioms (and I have a feeling it changes with impl) ... including nested foreach. For example : B = foreach A { X0 = ... X = .. work on X0 ...; GENERATE X, udf1(X), udf2(X); } will cause X0/X to be evaluated multi

Re: cross join

2010-04-28 Thread Mridul Muralidharan
CROSS is not a join, it is simply cartesian product. Where did you see cross join ? maybe I am missing something ... Regards, Mridul On Wednesday 28 April 2010 07:51 AM, hc busy wrote: guys, I'm looking at the doc's for CROSS join and noticed that it's not really a cross join, more rather jus

Re: result of OUTER JOIN

2010-04-22 Thread Mridul Muralidharan
Hi Alex, This is a bug in pig imo where it is pushing the filter before the join : when it should not. To validate, simply introduce an intermediate store/load pair to see right results. There probably already is some JIRA similar to this, if yes - please do add to that or please do crea

Re: Bug in FILTER with IS (NOT) NULL ?

2010-04-21 Thread Mridul Muralidharan
r issues). --- A = load 'input' AS (src:chararray, tgt:chararray, sc1:int); B = GROUP A by src PARALLEL $PARALLELISM; C = FILTER B by NOT IsEmpty($1) ; dump C --- Sorry for the confusion Regards, Mridul On Thursday 22 April 2010 01:17 AM, Mridul Muralidharan wrote: Hi, Ju

Re: Bug in FILTER with IS (NOT) NULL ?

2010-04-21 Thread Mridul Muralidharan
this, there is a pig construct iirc "$2 IS NOT NULL" works, you dont need the udf for that ... -- and T = filter U by my.udf.NOT(IsEmpty($3)); "IsEmpty($3) != false" ? or "IsEmpty($3) != true" can replace NOT udf ? Regards, Mridul it was for an older ver of pig

Re: Bug in FILTER with IS (NOT) NULL ?

2010-04-21 Thread Mridul Muralidharan
In case of co-group, if nothing matched the group key, you get an empty bag, not null. So checking for COUNT(alias) == 0 is what you need. Regards, Mridul On Wednesday 21 April 2010 03:37 PM, Alexander Schätzle wrote: Hello, I want to use IS NULL in a FILTER but the behavior seems to be

Re: InputSplit in UDF

2010-03-30 Thread Mridul Muralidharan
You might want to be careful with this ... the udf could get used in both map & reduce side, no ? Regards, Mridul On Wednesday 31 March 2010 02:22 AM, Sandesh Devaraju wrote: Hi All, Is there a way to get current InputSplit in a UDF (more specifically, a filter function)? I have a filter f

Re: more bagging fun

2010-03-09 Thread Mridul Muralidharan
On Tuesday 09 March 2010 04:13 AM, hc busy wrote: okay. Here's the bag that I have: {group: (a: int,b: chararray,c: chararray,d: int), TABLE: {number1: int, number2:int}} and I want to do this grunt> CALCULATE= FOREACH TABLE_group GENERATE group, SUM(TABLE.number1 / TABLE.number2); TAB

Re: COUNT(null bag)

2010-03-09 Thread Mridul Muralidharan
On Saturday 06 March 2010 04:35 AM, hc busy wrote: Guys, I have some data that has null bag. Looking at the COUNT.java it seems that it is an error condition for the bag passed in to be null (instead of zero for example.) I tried to change it to an empty bag when it's null data = FOREACH input

Re: Reducers slowing down? (UNCLASSIFIED)

2010-03-05 Thread Mridul Muralidharan
On Saturday 06 March 2010 04:47 AM, Thejas Nair wrote: I am not sure why the rate at which output is generated is slowing down. But cross in pig is not optimized ­ it uses only one reducer. (a major limitation if you are trying to process lots of data with a large cluster!) CROSS is not suppos

Re: computation inside foreach generate block.

2010-03-01 Thread Mridul Muralidharan
Within same map or reduce step - as subsequent operators. I dont think it combines operators, but it is not different jobs - if that is what you are worried about. Think of it like a pipeline ... Regards, Mridul On Monday 01 March 2010 05:32 PM, prasenjit mukherjee wrote: Thanks that will w

Re: Loading multiple types.

2010-03-01 Thread Mridul Muralidharan
e separate delimiters for fields,bags ? Basically the content of my file should now be : a b c {(15,good),(24,total),(9,bad)} a b d {(2,bad),(6,good),(8,total)} -Prasen On Mon, Mar 1, 2010 at 2:23 AM, Mridul Muralidharan wrote: Your schema is essentially : (stri

Re: filter/join by sql like "%pattern" condition

2010-02-28 Thread Mridul Muralidharan
Slightly digressing and possibly rambling - feel free to ignore ! Making it a general problem when both lists are 'large' (too large to fit into memory). A general solution for this, when the list of blacklist emails, is an interesting problem. Probably something which might benefit from the

Re: Filter Inside Nested Foreach

2010-02-28 Thread Mridul Muralidharan
Just curious, what was the actual error with using filter's within nested foreach ? Will it be possible to show the snippet ? (and schema of input ?). We are using this without issue right now, so curious what the problem here is .. Thanks, Mridul On Friday 26 February 2010 03:17 AM, zaki

Re: python based UDFs

2010-02-28 Thread Mridul Muralidharan
You can get in touch with Arnab if you want more info on it ... I am sure he will be very much interested to see others using it :-) Regards, Mridul On Friday 26 February 2010 08:43 AM, prasenjit mukherjee wrote: Any thoughts on including python-based UDFs like the following : http://arnab

Re: Loading multiple types.

2010-02-28 Thread Mridul Muralidharan
Note, as should be obvious, the new file will have the delimiter '\t' and not ','. To give us : r1 = load '/tmp/prasen/foo1.txt_new' using PigStorage('\t') AS (f1:chararray, f2:chararray,f3:chararray, B:{T1:(i1:int,s1:chararray)}); Regards, Mridul

Re: Loading multiple types.

2010-02-28 Thread Mridul Muralidharan
Your schema is essentially : (string, string, string, bag). With bag containing tuples with schema (number, string). Based on this, the schema should be what you described second - namely : r1 = load '/tmp/prasen/foo1.txt' using PigStorage(',') AS (f1:chararray, f2:chararray,f3:chararray, B

Re: wait ( or thread.join() ) in pig ?

2010-02-16 Thread Mridul Muralidharan
Is this documented behavior or current impl detail ? A lot of scripts broke when multi-query optimization was committed to trunk because of the implicit ordering assumption (based on STORE) in earlier pig - which was, iirc, documented. Regards, Mridul On Thursday 11 February 2010 10:52 PM,

Re: Pig 0.6 average (AVG) question

2010-02-09 Thread Mridul Muralidharan
e group by, even if it's only null values. I just wandered if theres anything to be done about the NPE to make it more clear, that's all. I guess you can see this as an eventual feature / improvement of some sort, no problems :) alex On Tue, Feb 9, 2010 at 11:35 AM, Mridul Mura

Re: Pig 0.6 average (AVG) question

2010-02-09 Thread Mridul Muralidharan
On second thought, probably A itself is NULL - in which case you will need a null check on A, and not on A.v (which, I think, is handled iirc). Regards, Mridul On Tuesday 09 February 2010 04:02 PM, Mridul Muralidharan wrote: Without knowing rest of the script, you could do something like

Re: Pig 0.6 average (AVG) question

2010-02-09 Thread Mridul Muralidharan
Without knowing rest of the script, you could do something like : C = FOREACH B { X = FILTER A BY v IS NOT NULL; GENERATE group, (int)AVG(X) as statsavg; }; I am assuming it is cos there are nulls in your bag field. Regards, Mridul On Tuesday 09 February 2010 03:52 PM, Alex Parvulescu wr

Re: setNumReduceTasks(1)

2010-01-30 Thread Mridul Muralidharan
't want to write any random N rows to the table. I want to write the *top* N rows - meaning - I want to write the "key" values of the Reducer in descending order. Does this make sense? Sorry for the confusion. On Wed, Jan 27, 2010 at 11:09 PM, Mridul Muralidharan < mrid..

Re: How to write an UDF to pass Two parameters to a UDF Filter function.........

2010-01-29 Thread Mridul Muralidharan
There is an error in the basic script - which I propagated in my copy paste - corrected below. Regards, Mridul Mridul Muralidharan wrote: There are two ways to handle this. You can pass it along as a parameter as you did in the script - though note that, in your udf, it will be a tuple

Re: How to write an UDF to pass Two parameters to a UDF Filter function.........

2010-01-29 Thread Mridul Muralidharan
There are two ways to handle this. You can pass it along as a parameter as you did in the script - though note that, in your udf, it will be a tuple with first field == category, second field == "110". public Boolean exec(Tuple _input) throws IOException { String input = (String)_input.

Re: setNumReduceTasks(1)

2010-01-27 Thread Mridul Muralidharan
A possible solution is to emit only N rows from each mapper and then use 1 reduce task [*] - if value of N is not very high. So you end up with utmost m * N rows on reducer instead of full inputset - and so the limit can be done easier. If you ok with some sort of variance in the number of r

Re: setNumReduceTasks(1)

2010-01-27 Thread Mridul Muralidharan
Jeff Zhang wrote: *See my comments below* On Mon, Jan 25, 2010 at 3:22 PM, Something Something < mailinglist...@gmail.com> wrote: If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class be instantiated only on one machine.. always? I mean if I have a cluster of say 1 maste

Re: setNumReduceTasks(1)

2010-01-27 Thread Mridul Muralidharan
A possible solution is to emit only N rows from each mapper and then use 1 reduce task [*] - if value of N is not very high. So you end up with utmost m * N rows on reducer instead of full inputset - and so the limit can be done easier. If you ok with some sort of variance in the number of r

Re: setNumReduceTasks(1)

2010-01-27 Thread Mridul Muralidharan
On Tue, Jan 26, 2010 at 3:08 PM, Mridul Muralidharan wrote: Jeff Zhang wrote: *See my comments below* On Mon, Jan 25, 2010 at 3:22 PM, Something Something < mailinglist...@gmail.com> wrote: If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class be instantiat

Re: setNumReduceTasks(1)

2010-01-25 Thread Mridul Muralidharan
Jeff Zhang wrote: *See my comments below* On Mon, Jan 25, 2010 at 3:22 PM, Something Something < mailinglist...@gmail.com> wrote: If I set # of reduce tasks to 1 using setNumReduceTasks(1), would the class be instantiated only on one machine.. always? I mean if I have a cluster of say 1 maste

Re: enforcing number of mappers

2010-01-24 Thread Mridul Muralidharan
If each line from your file has to be processed by a different mapper - other than by writing a custom slicer, a very dirty hack would be to : a) create N number of files with one line each. b) Or, do something like : input_lines = load 'my_s3_list_file' as (location_line:chararray); grp_op = G

Re: Initial Benchmark Results

2010-01-19 Thread Mridul Muralidharan
The only other suggestion I can make, other than what has already been mentioned by others, is to parameterize the PARALLEL value - so that you use optimal number of reducers for the test (depending on the cluster size and the number of reducers per node). Regards, Mridul Rob Stewart wrote

Re: Secondary indexes and transactions

2010-01-19 Thread Mridul Muralidharan
y much for digging in here, a second set of eyes is handy. -clint On Tue, Jan 19, 2010 at 1:37 AM, Mridul Muralidharan wrote: Clint Morgan wrote: After the 2PC process has determined that a commit should happen there is no roll-back. The commit must be processe

Re: Secondary indexes and transactions

2010-01-19 Thread Mridul Muralidharan
it failure in a indexed regionserver does a rollback of the txn, then the issue I mentioned can occur ? Thanks for your patience and time ! Regards, Mridul -clint On Fri, Jan 15, 2010 at 2:43 AM, Mridul Muralidharan wrote: I think I might not have explained it well enough. As part of execu

Re: Piglet: a Ruby DSL for writing Pig scripts

2010-01-15 Thread Mridul Muralidharan
risen =) I should probably start a Google group or something. T# On Fri, Jan 15, 2010 at 11:56 AM, Mridul Muralidharan wrote: This looks really promising Theo ! Is there some mailing list where discussions & queries related to piglet are discussed ? Thanks, Mridul Theo Hultberg wrote: Hi,

Re: Piglet: a Ruby DSL for writing Pig scripts

2010-01-15 Thread Mridul Muralidharan
This looks really promising Theo ! Is there some mailing list where discussions & queries related to piglet are discussed ? Thanks, Mridul Theo Hultberg wrote: Hi, I've written a Ruby DSL for writing Pig scripts, which I hope might interest some of you. It makes it possible to do a lot of

Re: Secondary indexes and transactions

2010-01-15 Thread Mridul Muralidharan
int On Sun, Jan 3, 2010 at 4:46 PM, Mridul Muralidharan wrote: stack wrote: On Sun, Jan 3, 2010 at 10:46 AM, Mridul Muralidharan wrote: I was wondering about the atomicity guarantees when using secondary indexes from within a transaction. You are talking about indexed hbase from transact

Re: Conditional Selects

2010-01-12 Thread Mridul Muralidharan
As a suffix to what Dmitriy described - just add a project to pick the columns you need. c = join a by filename, b by filename PARALLEL $MY_PARALLELISM; --- Please check this syntax though with pig latin docs. d = foreach c generate a::filename; --- Or anything else you want to pick. if you ne

Re: Analyzing MySQL slow query logs using Pig + Hadoop

2010-01-12 Thread Mridul Muralidharan
Chris Hartjes wrote: My apologies if this is the wrong mailing list to ask this question. I've started playing around with Pig and Hadoop, with the intention of using it to do some analysis of a collection of MySQL slow query log files. I am not a Java programmer (been using PHP for a very long

Re: [BOSH] Pipelining / avoiding use of 2x HTTP-sockets

2010-01-11 Thread Mridul Muralidharan
To add to what Mathew clarified - you will need to send that empty request when server has responded to all your requests. This happens typically when : a) the request was held at CM for the max configured time. b) CM/server had something to send to client. Regards, Mridul --- On Mon, 11/1/10,

ChainMapper with MultipleInputs

2010-01-11 Thread Mridul Muralidharan
Hi, Is there a way to specify Chainned mappers with Multiple Inputs ? Essentially building a pipeline of mapper's based on the different input's involved ? For something like : current_data (seq_files) -> process -> emit key, value. new_data (text_files) -> sanitize -> preprocess -> proce

Re: MR in HBase

2010-01-10 Thread Mridul Muralidharan
12:26 AM, Mridul Muralidharan wrote: Hi, This is assuming there is no easier way to do it (someone from hbase team can comment better !). But the usual way to handle this for mapreduce is to create a composite input format : which delegates to the underlying formats to generate the splits, and

Re: [BOSH] Pipelining / avoiding use of 2x HTTP-sockets

2010-01-09 Thread Mridul Muralidharan
like same thing happen there too. > > Since this seemed like a relevant ongoing thread, i though > i would clear my point here. > Is this how it should be? > > Abhinav Singh, > Bangalore, > India > http://abhinavsingh.com/blog > > From: > Mridul Muralidharan &g

Re: [BOSH] Pipelining / avoiding use of 2x HTTP-sockets

2010-01-08 Thread Mridul Muralidharan
--- On Sat, 9/1/10, Peter Saint-Andre wrote: > From: Peter Saint-Andre > Subject: Re: [BOSH] Pipelining / avoiding use of 2x HTTP-sockets > To: "Bidirectional Streams Over Synchronous HTTP" > Date: Saturday, 9 January, 2010, 1:50 AM > On 12/30/09 8:47 AM, Mr

FW: Read op question

2010-01-08 Thread Mridul Muralidharan
A collegue is unable to send this mail to the list, so proxying it. Thanks in advance for the responses ! Regards, Mridul --- Hi, I'm trying to better understand the flow of the client read operation in HBase. I've been looking at a combination of the HBase documents, Lars George's summar

Re: Adding new region servers without restart

2010-01-08 Thread Mridul Muralidharan
clarifying ! Regards, Mridul Jean-Daniel Cryans wrote: Use the commands described here: http://wiki.apache.org/hadoop/Hbase/RollingRestart J-D On Fri, Jan 8, 2010 at 11:49 AM, Mridul Muralidharan wrote: Hi, Suppose I want to add a new region server to my instance. I imagine I need to add it to the

Adding new region servers without restart

2010-01-08 Thread Mridul Muralidharan
Hi, Suppose I want to add a new region server to my instance. I imagine I need to add it to the list in the conf files for Hbase and Hadoop, and then stop/start the cluster. Is there any way to add the server without stopping the system? Thanks, Mridul

<    8   9   10   11   12   13   14   15   16   >