Re: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file

2015-01-19 Thread BB
I am quoting the reply I got on this - which for some reason did not get
posted here. The suggestion in the reply below worked perfectly for me. The
error mentioned in the reply is not related (or old).
 Hope this is helpful to someone.
Cheers,
BB


 Hi, BB
Ideally you can do the query like: select key, value.percent from
 mytable_data lateral view explode(audiences) f as key, value limit 3;
But there is a bug in HiveContext:
 https://issues.apache.org/jira/browse/SPARK-5237
I am working on it now, hopefully make a patch soon.
 
 Cheng Hao





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/using-hiveContext-to-select-a-nested-Map-data-type-from-an-AVROmodel-parquet-file-tp21168p21231.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file

2015-01-17 Thread Cheng, Hao
Wow,  glad to know that it works well, and sorry, the Jira is another issue, 
which is not the same case here.

From: Bagmeet Behera [mailto:bagme...@gmail.com]
Sent: Saturday, January 17, 2015 12:47 AM
To: Cheng, Hao
Subject: Re: using hiveContext to select a nested Map-data-type from an 
AVROmodel+parquet file

Hi Cheng, Hao
   An update: I installed the latest binaries of Spark 1.2.0 (prebuild for 
Hadoop 2.4 and later) and tried your suggestion. And it *works* perfectly!
   Therefore I would encourage you to post your reply on the archive for the 
advantage of all.

Thanks and best wishes,
BB (Bagmeet)

On Fri, Jan 16, 2015 at 11:20 AM, Bagmeet Behera 
bagme...@gmail.commailto:bagme...@gmail.com wrote:
Hi Chen, Hao
 The awesome thing is: the way you suggest works perfectly on Spark 1.1.0. - I 
am testing this on a old test installation with Spark 1.1.0 (installed from 
http://spark.apache.org/) with scala 2.10.4.

 Just fyi: This was because I could not create a HiveContext on the newer 
installation of spark 1.2.0 (scala 2.10.4) - from Cloudera CDH release 5.3.0 - 
which gave some strange error that looked like there is some incompatibility 
between hive and spark libraries. I can create a post for this (if I find an 
appropriate user group, perhaps on cloudera side) but would this be also the 
result of the bug you mention?

 BTW your reply is not in the archives. I guess this is also because of the bug 
in the current version you mentioned?

 Many thanks for the reply.
Best,
BB


On Fri, Jan 16, 2015 at 3:24 AM, Cheng, Hao 
hao.ch...@intel.commailto:hao.ch...@intel.com wrote:
Hi, BB
   Ideally you can do the query like: select key, value.percent from 
mytable_data lateral view explode(audiences) f as key, value limit 3;
   But there is a bug in HiveContext: 
https://issues.apache.org/jira/browse/SPARK-5237
   I am working on it now, hopefully make a patch soon.

Cheng Hao

-Original Message-
From: BB [mailto:bagme...@gmail.commailto:bagme...@gmail.com]
Sent: Friday, January 16, 2015 12:52 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: using hiveContext to select a nested Map-data-type from an 
AVROmodel+parquet file

Hi all,
  Any help on the following is very much appreciated.
=
Problem:
  On a schemaRDD read from a parquet file (data within file uses AVRO model) 
using the HiveContext:
 I can't figure out how to 'select' or use 'where' clause, to filter rows 
on a field that has a Map AVRO-data-type. I want to do a filtering using a 
given ('key' : 'value'). How could I do this?

Details:
* the printSchema of the loaded schemaRDD is like so:

-- output snippet -
|-- created: long (nullable = false)
|-- audiences: map (nullable = true)
||-- key: string
||-- value: struct (valueContainsNull = false)
|||-- percent: float (nullable = false)
|||-- cluster: integer (nullable = false)
-

* I dont get a result when I try to select on a specific value of the 
'audience' like so:

  SELECT created, audiences FROM mytable_data LATERAL VIEW
explode(audiences) adtab AS adcol WHERE audiences['key']=='tg_loh' LIMIT 10

 sequence of commands on the spark-shell (a different query and output) is:

-- code snippet -
scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala val parquetFile2 =
hiveContext.parquetFile(/home/myuser/myparquetfile)
scala parquetFile2.registerTempTable(mytable_data)
scala hiveContext.cacheTable(mytable_data)

scala hiveContext.sql(SELECT  audiences['key'], audiences['value']
scala FROM
mytable_data LATERAL VIEW explode(audiences) adu AS audien LIMIT
3).collect().foreach(println)

-- output -
[null,null]
[null,null]
[null,null]


gives a list of nulls. I can see that there is data when I just do the 
following (output is truncated):

-- code snippet -
scala hiveContext.sql(SELECT audiences FROM mytable_data LATERAL VIEW
explode(audiences) tablealias AS colalias LIMIT
1).collect().foreach(println)

 output --
[Map(tg_loh - [0.0,1,Map()], tg_co - [0.0,1,Map(tg_co_petrol - 0.0)], 
tg_wall - [0.0,1,Map(tg_wall_poi - 0.0)],  ...


Q1) What am I doing wrong?
Q2) How can I use 'where' in the query to filter on specific values?

What works:
   Queries with filtering, and selecting on fields that have simple AVRO 
data-types, such as long or string works fine.

===

 I hope the explanation makes sense. Thanks.
Best,
BB



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/using-hiveContext-to-select-a-nested-Map-data-type-from-an-AVROmodel-parquet-file-tp21168.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 

RE: using hiveContext to select a nested Map-data-type from an AVROmodel+parquet file

2015-01-15 Thread Cheng, Hao
Hi, BB
   Ideally you can do the query like: select key, value.percent from 
mytable_data lateral view explode(audiences) f as key, value limit 3;
   But there is a bug in HiveContext: 
https://issues.apache.org/jira/browse/SPARK-5237
   I am working on it now, hopefully make a patch soon.

Cheng Hao

-Original Message-
From: BB [mailto:bagme...@gmail.com] 
Sent: Friday, January 16, 2015 12:52 AM
To: user@spark.apache.org
Subject: using hiveContext to select a nested Map-data-type from an 
AVROmodel+parquet file

Hi all,
  Any help on the following is very much appreciated.
=
Problem:
  On a schemaRDD read from a parquet file (data within file uses AVRO model) 
using the HiveContext:
 I can't figure out how to 'select' or use 'where' clause, to filter rows 
on a field that has a Map AVRO-data-type. I want to do a filtering using a 
given ('key' : 'value'). How could I do this?

Details:
* the printSchema of the loaded schemaRDD is like so:

-- output snippet -
|-- created: long (nullable = false)
|-- audiences: map (nullable = true)
||-- key: string
||-- value: struct (valueContainsNull = false)
|||-- percent: float (nullable = false)
|||-- cluster: integer (nullable = false)
- 

* I dont get a result when I try to select on a specific value of the 
'audience' like so:
 
  SELECT created, audiences FROM mytable_data LATERAL VIEW
explode(audiences) adtab AS adcol WHERE audiences['key']=='tg_loh' LIMIT 10

 sequence of commands on the spark-shell (a different query and output) is:

-- code snippet -
scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala val parquetFile2 =
hiveContext.parquetFile(/home/myuser/myparquetfile)
scala parquetFile2.registerTempTable(mytable_data)
scala hiveContext.cacheTable(mytable_data)

scala hiveContext.sql(SELECT  audiences['key'], audiences['value'] 
scala FROM
mytable_data LATERAL VIEW explode(audiences) adu AS audien LIMIT
3).collect().foreach(println)

-- output -
[null,null]
[null,null]
[null,null]


gives a list of nulls. I can see that there is data when I just do the 
following (output is truncated):

-- code snippet -
scala hiveContext.sql(SELECT audiences FROM mytable_data LATERAL VIEW
explode(audiences) tablealias AS colalias LIMIT
1).collect().foreach(println)

 output --
[Map(tg_loh - [0.0,1,Map()], tg_co - [0.0,1,Map(tg_co_petrol - 0.0)], 
tg_wall - [0.0,1,Map(tg_wall_poi - 0.0)],  ...


Q1) What am I doing wrong?
Q2) How can I use 'where' in the query to filter on specific values?

What works:
   Queries with filtering, and selecting on fields that have simple AVRO 
data-types, such as long or string works fine.

===

 I hope the explanation makes sense. Thanks.
Best,
BB



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/using-hiveContext-to-select-a-nested-Map-data-type-from-an-AVROmodel-parquet-file-tp21168.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org