Re: pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-19 Thread fahad shah
Thanks Davies, sure, I can share the code/data in pm - best fahad

On Mon, Oct 19, 2015 at 10:52 AM, Davies Liu  wrote:
> Could you simplify the code a little bit so we can reproduce the failure?
> (may also have some sample dataset if it depends on them)
>
> On Sun, Oct 18, 2015 at 10:42 PM, fahad shah  wrote:
>>  Hi
>>
>> I am trying to do pair rdd's, group by the key assign id based on key.
>> I am using Pyspark with spark 1.3, for some reason, I am getting this
>> error that I am unable to figure out - any help much appreciated.
>>
>> Things I tried (but to no effect),
>>
>> 1. make sure I am not doing any conversions on the strings
>> 2. make sure that the fields used in the key are all there  and not
>> empty string (or else I toss the row out)
>>
>> My code is along following lines (split is using stringio to parse
>> csv, header removes the header row and parse_train is putting the 54
>> fields into named tuple after whitespace/quote removal):
>>
>> #Error for string argument is thrown on the BB.take(1) where the
>> groupbykey is evaluated
>>
>> A = sc.textFile("train.csv").filter(lambda x:not
>> isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is
>> None)
>>
>> A.count()
>>
>> B = A.map(lambda k:
>> ((k.srch_destination_id,k.srch_length_of_stay,k.srch_booking_window,k.srch_adults_count,
>>  k.srch_children_count,k.srch_room_count), 
>> (k[0:54])))
>> BB = B.groupByKey()
>> BB.take(1)
>>
>>
>> best fahad
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: best way to generate per key auto increment numerals after sorting

2015-10-19 Thread fahad shah
Thanks Davies,

groupbykey was throwing up the error: unpack requires a string
argument of length 4

interestingly, I replace that with the sortbykey (which i read also
shuffles so that data for same key are on same partition) and it ran
fine - wondering if this a bug on groupbykey for Spark 1.3?

best fahad

On Mon, Oct 19, 2015 at 10:45 AM, Davies Liu  wrote:
> What's the issue with groupByKey()?
>
> On Mon, Oct 19, 2015 at 1:11 AM, fahad shah  wrote:
>> Hi
>>
>> I wanted to ask whats the best way to achieve per key auto increment
>> numerals after sorting, for eg. :
>>
>> raw file:
>>
>> 1,a,b,c,1,1
>> 1,a,b,d,0,0
>> 1,a,b,e,1,0
>> 2,a,e,c,0,0
>> 2,a,f,d,1,0
>>
>> post-output (the last column is the position number after grouping on
>> first three fields and reverse sorting on last two values)
>>
>> 1,a,b,c,1,1,1
>> 1,a,b,d,0,0,3
>> 1,a,b,e,1,0,2
>> 2,a,e,c,0,0,2
>> 2,a,f,d,1,0,1
>>
>> I am using solution that uses groupbykey but that is running into some
>> issues (possibly bug with pyspark/spark?), wondering if there is a
>> better way to achieve this.
>>
>> My solution:
>>
>> A = A = sc.textFile("train.csv").filter(lambda x:not
>> isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is
>> None)
>>
>> B = A.map(lambda k:
>> ((k.first_field,k.second_field,k.first_field,k.third_field),
>> (k[0:5]))).groupByKey()
>>
>> B.map(sort_n_set_position).flatMap(lambda line: line)
>>
>> where sort and set position iterates over the iterator and performs
>> sorting and adding last column.
>>
>> best fahad
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



best way to generate per key auto increment numerals after sorting

2015-10-19 Thread fahad shah
Hi

I wanted to ask whats the best way to achieve per key auto increment
numerals after sorting, for eg. :

raw file:

1,a,b,c,1,1
1,a,b,d,0,0
1,a,b,e,1,0
2,a,e,c,0,0
2,a,f,d,1,0

post-output (the last column is the position number after grouping on
first three fields and reverse sorting on last two values)

1,a,b,c,1,1,1
1,a,b,d,0,0,3
1,a,b,e,1,0,2
2,a,e,c,0,0,2
2,a,f,d,1,0,1

I am using solution that uses groupbykey but that is running into some
issues (possibly bug with pyspark/spark?), wondering if there is a
better way to achieve this.

My solution:

A = A = sc.textFile("train.csv").filter(lambda x:not
isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is
None)

B = A.map(lambda k:
((k.first_field,k.second_field,k.first_field,k.third_field),
(k[0:5]))).groupByKey()

B.map(sort_n_set_position).flatMap(lambda line: line)

where sort and set position iterates over the iterator and performs
sorting and adding last column.

best fahad

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-18 Thread fahad shah
uler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
 at scala.Option.foreach(Option.scala:236)
 at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
 at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

On Sun, Oct 18, 2015 at 11:17 PM, Jeff Zhang  wrote:
> Stacktrace would be helpful if you can provide that.
>
>
>
> On Mon, Oct 19, 2015 at 1:42 PM, fahad shah  wrote:
>>
>>  Hi
>>
>> I am trying to do pair rdd's, group by the key assign id based on key.
>> I am using Pyspark with spark 1.3, for some reason, I am getting this
>> error that I am unable to figure out - any help much appreciated.
>>
>> Things I tried (but to no effect),
>>
>> 1. make sure I am not doing any conversions on the strings
>> 2. make sure that the fields used in the key are all there  and not
>> empty string (or else I toss the row out)
>>
>> My code is along following lines (split is using stringio to parse
>> csv, header removes the header row and parse_train is putting the 54
>> fields into named tuple after whitespace/quote removal):
>>
>> #Error for string argument is thrown on the BB.take(1) where the
>> groupbykey is evaluated
>>
>> A = sc.textFile("train.csv").filter(lambda x:not
>> isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is
>> None)
>>
>> A.count()
>>
>> B = A.map(lambda k:
>>
>> ((k.srch_destination_id,k.srch_length_of_stay,k.srch_booking_window,k.srch_adults_count,
>>  k.srch_children_count,k.srch_room_count),
>> (k[0:54])))
>> BB = B.groupByKey()
>> BB.take(1)
>>
>>
>> best fahad
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-18 Thread fahad shah
 Hi

I am trying to do pair rdd's, group by the key assign id based on key.
I am using Pyspark with spark 1.3, for some reason, I am getting this
error that I am unable to figure out - any help much appreciated.

Things I tried (but to no effect),

1. make sure I am not doing any conversions on the strings
2. make sure that the fields used in the key are all there  and not
empty string (or else I toss the row out)

My code is along following lines (split is using stringio to parse
csv, header removes the header row and parse_train is putting the 54
fields into named tuple after whitespace/quote removal):

#Error for string argument is thrown on the BB.take(1) where the
groupbykey is evaluated

A = sc.textFile("train.csv").filter(lambda x:not
isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is
None)

A.count()

B = A.map(lambda k:
((k.srch_destination_id,k.srch_length_of_stay,k.srch_booking_window,k.srch_adults_count,
 k.srch_children_count,k.srch_room_count), (k[0:54])))
BB = B.groupByKey()
BB.take(1)


best fahad

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org