Re: pyspark groupbykey throwing error: unpack requires a string argument of length 4
Thanks Davies, sure, I can share the code/data in pm - best fahad On Mon, Oct 19, 2015 at 10:52 AM, Davies Liu wrote: > Could you simplify the code a little bit so we can reproduce the failure? > (may also have some sample dataset if it depends on them) > > On Sun, Oct 18, 2015 at 10:42 PM, fahad shah wrote: >> Hi >> >> I am trying to do pair rdd's, group by the key assign id based on key. >> I am using Pyspark with spark 1.3, for some reason, I am getting this >> error that I am unable to figure out - any help much appreciated. >> >> Things I tried (but to no effect), >> >> 1. make sure I am not doing any conversions on the strings >> 2. make sure that the fields used in the key are all there and not >> empty string (or else I toss the row out) >> >> My code is along following lines (split is using stringio to parse >> csv, header removes the header row and parse_train is putting the 54 >> fields into named tuple after whitespace/quote removal): >> >> #Error for string argument is thrown on the BB.take(1) where the >> groupbykey is evaluated >> >> A = sc.textFile("train.csv").filter(lambda x:not >> isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is >> None) >> >> A.count() >> >> B = A.map(lambda k: >> ((k.srch_destination_id,k.srch_length_of_stay,k.srch_booking_window,k.srch_adults_count, >> k.srch_children_count,k.srch_room_count), >> (k[0:54]))) >> BB = B.groupByKey() >> BB.take(1) >> >> >> best fahad >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: best way to generate per key auto increment numerals after sorting
Thanks Davies, groupbykey was throwing up the error: unpack requires a string argument of length 4 interestingly, I replace that with the sortbykey (which i read also shuffles so that data for same key are on same partition) and it ran fine - wondering if this a bug on groupbykey for Spark 1.3? best fahad On Mon, Oct 19, 2015 at 10:45 AM, Davies Liu wrote: > What's the issue with groupByKey()? > > On Mon, Oct 19, 2015 at 1:11 AM, fahad shah wrote: >> Hi >> >> I wanted to ask whats the best way to achieve per key auto increment >> numerals after sorting, for eg. : >> >> raw file: >> >> 1,a,b,c,1,1 >> 1,a,b,d,0,0 >> 1,a,b,e,1,0 >> 2,a,e,c,0,0 >> 2,a,f,d,1,0 >> >> post-output (the last column is the position number after grouping on >> first three fields and reverse sorting on last two values) >> >> 1,a,b,c,1,1,1 >> 1,a,b,d,0,0,3 >> 1,a,b,e,1,0,2 >> 2,a,e,c,0,0,2 >> 2,a,f,d,1,0,1 >> >> I am using solution that uses groupbykey but that is running into some >> issues (possibly bug with pyspark/spark?), wondering if there is a >> better way to achieve this. >> >> My solution: >> >> A = A = sc.textFile("train.csv").filter(lambda x:not >> isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is >> None) >> >> B = A.map(lambda k: >> ((k.first_field,k.second_field,k.first_field,k.third_field), >> (k[0:5]))).groupByKey() >> >> B.map(sort_n_set_position).flatMap(lambda line: line) >> >> where sort and set position iterates over the iterator and performs >> sorting and adding last column. >> >> best fahad >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
best way to generate per key auto increment numerals after sorting
Hi I wanted to ask whats the best way to achieve per key auto increment numerals after sorting, for eg. : raw file: 1,a,b,c,1,1 1,a,b,d,0,0 1,a,b,e,1,0 2,a,e,c,0,0 2,a,f,d,1,0 post-output (the last column is the position number after grouping on first three fields and reverse sorting on last two values) 1,a,b,c,1,1,1 1,a,b,d,0,0,3 1,a,b,e,1,0,2 2,a,e,c,0,0,2 2,a,f,d,1,0,1 I am using solution that uses groupbykey but that is running into some issues (possibly bug with pyspark/spark?), wondering if there is a better way to achieve this. My solution: A = A = sc.textFile("train.csv").filter(lambda x:not isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is None) B = A.map(lambda k: ((k.first_field,k.second_field,k.first_field,k.third_field), (k[0:5]))).groupByKey() B.map(sort_n_set_position).flatMap(lambda line: line) where sort and set position iterates over the iterator and performs sorting and adding last column. best fahad - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: pyspark groupbykey throwing error: unpack requires a string argument of length 4
uler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) On Sun, Oct 18, 2015 at 11:17 PM, Jeff Zhang wrote: > Stacktrace would be helpful if you can provide that. > > > > On Mon, Oct 19, 2015 at 1:42 PM, fahad shah wrote: >> >> Hi >> >> I am trying to do pair rdd's, group by the key assign id based on key. >> I am using Pyspark with spark 1.3, for some reason, I am getting this >> error that I am unable to figure out - any help much appreciated. >> >> Things I tried (but to no effect), >> >> 1. make sure I am not doing any conversions on the strings >> 2. make sure that the fields used in the key are all there and not >> empty string (or else I toss the row out) >> >> My code is along following lines (split is using stringio to parse >> csv, header removes the header row and parse_train is putting the 54 >> fields into named tuple after whitespace/quote removal): >> >> #Error for string argument is thrown on the BB.take(1) where the >> groupbykey is evaluated >> >> A = sc.textFile("train.csv").filter(lambda x:not >> isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is >> None) >> >> A.count() >> >> B = A.map(lambda k: >> >> ((k.srch_destination_id,k.srch_length_of_stay,k.srch_booking_window,k.srch_adults_count, >> k.srch_children_count,k.srch_room_count), >> (k[0:54]))) >> BB = B.groupByKey() >> BB.take(1) >> >> >> best fahad >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > > > > -- > Best Regards > > Jeff Zhang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
pyspark groupbykey throwing error: unpack requires a string argument of length 4
Hi I am trying to do pair rdd's, group by the key assign id based on key. I am using Pyspark with spark 1.3, for some reason, I am getting this error that I am unable to figure out - any help much appreciated. Things I tried (but to no effect), 1. make sure I am not doing any conversions on the strings 2. make sure that the fields used in the key are all there and not empty string (or else I toss the row out) My code is along following lines (split is using stringio to parse csv, header removes the header row and parse_train is putting the 54 fields into named tuple after whitespace/quote removal): #Error for string argument is thrown on the BB.take(1) where the groupbykey is evaluated A = sc.textFile("train.csv").filter(lambda x:not isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is None) A.count() B = A.map(lambda k: ((k.srch_destination_id,k.srch_length_of_stay,k.srch_booking_window,k.srch_adults_count, k.srch_children_count,k.srch_room_count), (k[0:54]))) BB = B.groupByKey() BB.take(1) best fahad - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org