UseCase_Design_Help

2016-10-05 Thread Ajay Chander
Ayan, Thanks for the help. In my scenario, currently, I have business rule i.e. Animal Types in a file(later in a hive table). I want to go after only those elements from the list. Once I identify the distinct counts, I have to write two different functionalities if count(distinct(element))<=10 and

Re: UseCase_Design_Help

2016-10-05 Thread ayan guha
Hi You can "generate" a sql through program. Python Example: >>> schema ['id', 'Mammals', 'Birds', 'Fish', 'Reptiles', 'Amphibians'] >>> >>> count_stmt=[ "count(distinct ) as ".replace("",x) for x in schema] >>> count_stmt ['count(distinct id) as id', 'count(distinct Mammals) as Mammals', 'count

Re: UseCase_Design_Help

2016-10-05 Thread Ajay Chander
+ user@spark.apache.org Hi Daniel, I will try this one out and let you know. Thank you. On Wed, Oct 5, 2016 at 9:50 AM, Daniel Siegmann < dsiegm...@securityscorecard.io> wrote: > I think it's fine to read animal types locally because there are only 70 > of them. It's just that you want to execut

Re: UseCase_Design_Help

2016-10-05 Thread Ajay Chander
Hi Ayan, My Schema for DF2 is fixed but it has around 420 columns (70 Animal type columns and 350 other columns). Thanks, Ajay On Wed, Oct 5, 2016 at 10:37 AM, ayan guha wrote: > Is your schema for df2 is fixed? ie do you have 70 category columns? > > On Thu, Oct 6, 2016 at 12:50 AM, Daniel Si

Re: UseCase_Design_Help

2016-10-05 Thread ayan guha
Is your schema for df2 is fixed? ie do you have 70 category columns? On Thu, Oct 6, 2016 at 12:50 AM, Daniel Siegmann < dsiegm...@securityscorecard.io> wrote: > I think it's fine to read animal types locally because there are only 70 > of them. It's just that you want to execute the Spark actions

Re: UseCase_Design_Help

2016-10-05 Thread Daniel Siegmann
I think it's fine to read animal types locally because there are only 70 of them. It's just that you want to execute the Spark actions in parallel. The easiest way to do that is to have only a single action. Instead of grabbing the result right away, I would just add a column for the animal type a

Re: UseCase_Design_Help

2016-10-04 Thread Daniel
First of all, if you want to read a txt file in Spark, you should use sc.textFile, because you are using "Source.fromFile", so you are reading it with Scala standard api, so it will be read sequentially. Furthermore you are going to need create a schema if you want to use dataframes. El 5/10/2016

Re: UseCase_Design_Help

2016-10-04 Thread Ajay Chander
Right now, I am doing it like below, import scala.io.Source val animalsFile = "/home/ajay/dataset/animal_types.txt" val animalTypes = Source.fromFile(animalsFile).getLines.toArray for ( anmtyp <- animalTypes ) { val distinctAnmTypCount = sqlContext.sql("select count(distinct("+anmtyp+")) f

UseCase_Design_Help

2016-10-04 Thread Ajay Chander
Hi Everyone, I have a use-case where I have two Dataframes like below, 1) First Dataframe(DF1) contains, *ANIMALS* Mammals Birds Fish Reptiles Amphibians 2) Second Dataframe(DF2) contains, *ID, Mammals, Birds, Fish, Reptiles, Amphibians* 1, Dogs, Eagle, Goldfish,