Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
I think this could be the reason :

DataFrame sorts the column of each record lexicographically if we do a *select
**. So, if we wish to maintain a specific column ordering while processing
we should use do *select col1, col2...* instead of select *.

However, this is just what I feel. Let's wait for comments from the gurus.



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



On Thu, Mar 3, 2016 at 5:35 AM, Mohammad Tariq  wrote:

> Cool. Here is it how it goes...
>
> I am reading Avro objects from a Kafka topic as a DStream, converting it
> into a DataFrame so that I can filter out records based on some conditions
> and finally do some aggregations on these filtered records. During the
> process I also need to tag each record based on the value of a particular
> column, and for this I am iterating over Array[Row] returned by
> DataFrame.collect().
>
> I am good as far as these things are concerned. The only thing which I am
> not getting is the reason behind changed column ordering within each Row.
> Say my actual record is [Tariq, IN, APAC]. When I
> do println(row.mkString("~")) it shows [IN~APAC~Tariq].
>
> I hope I was able to explain my use case to you!
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> 
>
>
> On Thu, Mar 3, 2016 at 5:21 AM, Sainath Palla 
> wrote:
>
>> Hi Tariq,
>>
>> Can you tell in brief what kind of operation you have to do? I can try
>> helping you out with that.
>> In general, if you are trying to use any group operations you can use
>> window operations.
>>
>> On Wed, Mar 2, 2016 at 6:40 PM, Mohammad Tariq 
>> wrote:
>>
>>> Hi Sainath,
>>>
>>> Thank you for the prompt response!
>>>
>>> Could you please elaborate your answer a bit? I'm sorry I didn't quite
>>> get this. What kind of operation I can perform using SQLContext? It just
>>> helps us during things like DF creation, schema application etc, IMHO.
>>>
>>>
>>>
>>> [image: http://]
>>>
>>> Tariq, Mohammad
>>> about.me/mti
>>> [image: http://]
>>> 
>>>
>>>
>>> On Thu, Mar 3, 2016 at 4:59 AM, Sainath Palla 
>>> wrote:
>>>
 Instead of collecting the data frame, you can try using a sqlContext on
 the data frame. But it depends on what kind of operations are you trying to
 perform.

 On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq 
 wrote:

> Hi list,
>
> *Scenario :*
> I am creating a DStream by reading an Avro object from a Kafka topic
> and then converting it into a DataFrame to perform some operations on the
> data. I call DataFrame.collect() and perform the intended operation on 
> each
> Row of Array[Row] returned by DataFrame.collect().
>
> *Problem : *
> Calling DataFrame.collect() changes the schema of the underlying
> record, thus making it impossible to get the columns by index(as the order
> gets changed).
>
> *Query :*
> Is it the way DataFrame.collect() behaves or am I doing something
> wrong here? In former case is there any way I can maintain the schema 
> while
> getting each Row?
>
> Any pointers/suggestions would be really helpful. Many thanks!
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> 
>
>


>>>
>>
>


Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
Cool. Here is it how it goes...

I am reading Avro objects from a Kafka topic as a DStream, converting it
into a DataFrame so that I can filter out records based on some conditions
and finally do some aggregations on these filtered records. During the
process I also need to tag each record based on the value of a particular
column, and for this I am iterating over Array[Row] returned by
DataFrame.collect().

I am good as far as these things are concerned. The only thing which I am
not getting is the reason behind changed column ordering within each Row.
Say my actual record is [Tariq, IN, APAC]. When I
do println(row.mkString("~")) it shows [IN~APAC~Tariq].

I hope I was able to explain my use case to you!



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



On Thu, Mar 3, 2016 at 5:21 AM, Sainath Palla 
wrote:

> Hi Tariq,
>
> Can you tell in brief what kind of operation you have to do? I can try
> helping you out with that.
> In general, if you are trying to use any group operations you can use
> window operations.
>
> On Wed, Mar 2, 2016 at 6:40 PM, Mohammad Tariq  wrote:
>
>> Hi Sainath,
>>
>> Thank you for the prompt response!
>>
>> Could you please elaborate your answer a bit? I'm sorry I didn't quite
>> get this. What kind of operation I can perform using SQLContext? It just
>> helps us during things like DF creation, schema application etc, IMHO.
>>
>>
>>
>> [image: http://]
>>
>> Tariq, Mohammad
>> about.me/mti
>> [image: http://]
>> 
>>
>>
>> On Thu, Mar 3, 2016 at 4:59 AM, Sainath Palla 
>> wrote:
>>
>>> Instead of collecting the data frame, you can try using a sqlContext on
>>> the data frame. But it depends on what kind of operations are you trying to
>>> perform.
>>>
>>> On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq 
>>> wrote:
>>>
 Hi list,

 *Scenario :*
 I am creating a DStream by reading an Avro object from a Kafka topic
 and then converting it into a DataFrame to perform some operations on the
 data. I call DataFrame.collect() and perform the intended operation on each
 Row of Array[Row] returned by DataFrame.collect().

 *Problem : *
 Calling DataFrame.collect() changes the schema of the underlying
 record, thus making it impossible to get the columns by index(as the order
 gets changed).

 *Query :*
 Is it the way DataFrame.collect() behaves or am I doing something wrong
 here? In former case is there any way I can maintain the schema while
 getting each Row?

 Any pointers/suggestions would be really helpful. Many thanks!


 [image: http://]

 Tariq, Mohammad
 about.me/mti
 [image: http://]
 


>>>
>>>
>>
>


Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Sainath Palla
Hi Tariq,

Can you tell in brief what kind of operation you have to do? I can try
helping you out with that.
In general, if you are trying to use any group operations you can use
window operations.

On Wed, Mar 2, 2016 at 6:40 PM, Mohammad Tariq  wrote:

> Hi Sainath,
>
> Thank you for the prompt response!
>
> Could you please elaborate your answer a bit? I'm sorry I didn't quite get
> this. What kind of operation I can perform using SQLContext? It just helps
> us during things like DF creation, schema application etc, IMHO.
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> 
>
>
> On Thu, Mar 3, 2016 at 4:59 AM, Sainath Palla 
> wrote:
>
>> Instead of collecting the data frame, you can try using a sqlContext on
>> the data frame. But it depends on what kind of operations are you trying to
>> perform.
>>
>> On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq 
>> wrote:
>>
>>> Hi list,
>>>
>>> *Scenario :*
>>> I am creating a DStream by reading an Avro object from a Kafka topic and
>>> then converting it into a DataFrame to perform some operations on the data.
>>> I call DataFrame.collect() and perform the intended operation on each Row
>>> of Array[Row] returned by DataFrame.collect().
>>>
>>> *Problem : *
>>> Calling DataFrame.collect() changes the schema of the underlying record,
>>> thus making it impossible to get the columns by index(as the order gets
>>> changed).
>>>
>>> *Query :*
>>> Is it the way DataFrame.collect() behaves or am I doing something wrong
>>> here? In former case is there any way I can maintain the schema while
>>> getting each Row?
>>>
>>> Any pointers/suggestions would be really helpful. Many thanks!
>>>
>>>
>>> [image: http://]
>>>
>>> Tariq, Mohammad
>>> about.me/mti
>>> [image: http://]
>>> 
>>>
>>>
>>
>>
>


Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
Hi Sainath,

Thank you for the prompt response!

Could you please elaborate your answer a bit? I'm sorry I didn't quite get
this. What kind of operation I can perform using SQLContext? It just helps
us during things like DF creation, schema application etc, IMHO.



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]



On Thu, Mar 3, 2016 at 4:59 AM, Sainath Palla 
wrote:

> Instead of collecting the data frame, you can try using a sqlContext on
> the data frame. But it depends on what kind of operations are you trying to
> perform.
>
> On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq  wrote:
>
>> Hi list,
>>
>> *Scenario :*
>> I am creating a DStream by reading an Avro object from a Kafka topic and
>> then converting it into a DataFrame to perform some operations on the data.
>> I call DataFrame.collect() and perform the intended operation on each Row
>> of Array[Row] returned by DataFrame.collect().
>>
>> *Problem : *
>> Calling DataFrame.collect() changes the schema of the underlying record,
>> thus making it impossible to get the columns by index(as the order gets
>> changed).
>>
>> *Query :*
>> Is it the way DataFrame.collect() behaves or am I doing something wrong
>> here? In former case is there any way I can maintain the schema while
>> getting each Row?
>>
>> Any pointers/suggestions would be really helpful. Many thanks!
>>
>>
>> [image: http://]
>>
>> Tariq, Mohammad
>> about.me/mti
>> [image: http://]
>> 
>>
>>
>
>


Re: Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Sainath Palla
Instead of collecting the data frame, you can try using a sqlContext on the
data frame. But it depends on what kind of operations are you trying to
perform.

On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq  wrote:

> Hi list,
>
> *Scenario :*
> I am creating a DStream by reading an Avro object from a Kafka topic and
> then converting it into a DataFrame to perform some operations on the data.
> I call DataFrame.collect() and perform the intended operation on each Row
> of Array[Row] returned by DataFrame.collect().
>
> *Problem : *
> Calling DataFrame.collect() changes the schema of the underlying record,
> thus making it impossible to get the columns by index(as the order gets
> changed).
>
> *Query :*
> Is it the way DataFrame.collect() behaves or am I doing something wrong
> here? In former case is there any way I can maintain the schema while
> getting each Row?
>
> Any pointers/suggestions would be really helpful. Many thanks!
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> 
>
>


Does DataFrame.collect() maintain the underlying schema?

2016-03-02 Thread Mohammad Tariq
Hi list,

*Scenario :*
I am creating a DStream by reading an Avro object from a Kafka topic and
then converting it into a DataFrame to perform some operations on the data.
I call DataFrame.collect() and perform the intended operation on each Row
of Array[Row] returned by DataFrame.collect().

*Problem : *
Calling DataFrame.collect() changes the schema of the underlying record,
thus making it impossible to get the columns by index(as the order gets
changed).

*Query :*
Is it the way DataFrame.collect() behaves or am I doing something wrong
here? In former case is there any way I can maintain the schema while
getting each Row?

Any pointers/suggestions would be really helpful. Many thanks!


[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]