RE: How to read a Multi Line json object via Spark

2016-11-14 Thread Kappaganthu, Sivaram (ES)
Hello,

Please find attached the old mail on this subject

Thanks,
Sivaram
From: Sree Eedupuganti [mailto:s...@inndata.in]
Sent: Tuesday, November 15, 2016 12:51 PM
To: user
Subject: How to read a Multi Line json object via Spark

I tried from Spark-Shell and i am getting the following error:

Here is the test.json file:


{

"colorsArray": [{

"red": "#f00",

"green": "#0f0",

"blue": "#00f",

"cyan": "#0ff",

"magenta": "#f0f",

"yellow": "#ff0",

"black": "#000"

}]

}


scala> val jtex = 
sqlContext.read.format("json").option("samplingRatio","1.0").load("/user/spark/test.json")



   jtex: org.apache.spark.sql.DataFrame = [_corrupt_record: string]

Any suggestions please. Thanks.
--
Best Regards,
Sreeharsha Eedupuganti
Data Engineer
innData Analytics Private Limited

--
This message and any attachments are intended only for the use of the addressee 
and may contain information that is privileged and confidential. If the reader 
of the message is not the intended recipient or an authorized representative of 
the intended recipient, you are hereby notified that any dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, notify the sender immediately by return email and delete the message 
and any attachments from your system.
--- Begin Message ---
Hi,

Thank you so much  for reply.  And the link is very good . I am new to both 
spark and python.  I am feeling little difficulty in understanding the below 
code. Could you please provide the scala code for the below python code.



In [100]: import json

  multiline_rdd=sc.wholeTextFiles(inputFile)

  type(multiline_rdd)



 import re

 json_rdd = multiline_rdd.map(lambda x : x[1])\

.map(lambda x : re.sub(r"\s+", "", x, \

flags=re.UNICODE))



Thanks,

SIvaram

From: Hyukjin Kwon [mailto:gurwls...@gmail.com]
Sent: Wednesday, October 12, 2016 11:45 AM
To: Kappaganthu, Sivaram (ES)
Cc: Luciano Resende; Jean Georges Perrin; user @spark
Subject: Re: JSON Arrays and Spark



No, I meant it should be in a single line but it supports array type too as a 
root wrapper of JSON objects.



If you need to parse multiple lines, I have a reference here.



http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/



2016-10-12 15:04 GMT+09:00 Kappaganthu, Sivaram (ES) 
<sivaram.kappagan...@adp.com<mailto:sivaram.kappagan...@adp.com>>:

Hi,



Does this mean that handling any Json with kind of below schema  with spark is 
not a good fit?? I have requirement to parse the below Json that spans across 
multiple lines. Whats the best way to parse the jsns of this kind?? Please 
suggest.



root

|-- maindate: struct (nullable = true)

||-- mainidnId: string (nullable = true)

|-- Entity: array (nullable = true)

||-- element: struct (containsNull = true)

|||-- Profile: struct (nullable = true)

||||-- Kind: string (nullable = true)

|||-- Identifier: string (nullable = true)

|||-- Group: array (nullable = true)

||||-- element: struct (containsNull = true)

|||||-- Period: struct (nullable = true)

||||||-- pid: string (nullable = true)

||||||-- pDate: string (nullable = true)

||||||-- quarter: long (nullable = true)

||||||-- labour: array (nullable = true)

|||||||-- element: struct (containsNull = true)

||||||||-- category: string (nullable = true)

||||||||-- id: string (nullable = true)

||||||||-- person: struct (nullable = true)

|||||||||-- address: array (nullable = true)

||||||||||-- element: struct (containsNull 
= true)

|||||||||||-- city: string (nullable = 
true)

|||||||||||-- line1: string (nullable = 
true)

|||||||||||-- line2: string (nullable = 
true)

|||||||||||-- postalCode: string 
(nullable = true)

|||||||||||-- state: string (nullable = 
true)

|||||||||||-- type: string (nullable = 
true)

|||||||||-- familyName: string (nullable = true)

||||||||-- tax: array (nullable = true)

||||   

Help needed in parsing JSon with nested structures

2016-10-31 Thread Kappaganthu, Sivaram (ES)
Hello All,



I am processing a nested complex Json and below is the schema for it.
root
|-- businessEntity: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- payGroup: array (nullable = true)
||||-- element: struct (containsNull = true)
|||||-- reportingPeriod: struct (nullable = true)
||||||-- worker: array (nullable = true)
|||||||-- element: struct (containsNull = true)
||||||||-- category: string (nullable = true)
||||||||-- person: struct (nullable = true)
||||||||-- tax: array (nullable = true)
|||||||||-- element: struct (containsNull = 
true)
||||||||||-- code: string (nullable = true)
||||||||||-- qtdAmount: double (nullable = 
true)
||||||||||-- ytdAmount: double (nullable =
My requirement is to create a hashmap with code concatenated with qtdAmount as 
key and value of qtdAmount as value. Map.put(code + "qtdAmount" , qtdAmount). 
How can i do this with spark.
I tried with below shell commands.
import org.apache.spark.sql._
val sqlcontext = new SQLContext(sc)
val cdm = sqlcontext.read.json("/user/edureka/CDM/cdm.json")
val spark = 
SparkSession.builder().appName("SQL").config("spark.some.config.option","some-vale").getOrCreate()
cdm.createOrReplaceTempView("CDM")
val sqlDF = spark.sql("SELECT businessEntity[0].payGroup[0] from CDM").show()
val address = spark.sql("SELECT 
businessEntity[0].payGroup[0].reportingPeriod.worker[0].person.address from CDM 
as address")
val worker = spark.sql("SELECT 
businessEntity[0].payGroup[0].reportingPeriod.worker from CDM")
val tax = spark.sql("SELECT 
businessEntity[0].payGroup[0].reportingPeriod.worker[0].tax from CDM")
val tax = sqlcontext.sql("SELECT 
businessEntity[0].payGroup[0].reportingPeriod.worker[0].tax from CDM")
val codes = tax.select(expode(tax("code"))
scala> val codes = 
tax.withColumn("code",explode(tax("tax.code"))).withColumn("qtdAmount",explode(tax("tax.qtdAmount"))).withColumn("ytdAmount",explode(tax("tax.ytdAmount")))


i am trying to get all the codes and qtdAmount into a map. But i am not getting 
it. Using multiple explode statements for a single DF, is producing Cartesian 
product of the elements.
Could someone please help on how to parse the json of this much complex in 
spark.


Thanks,
Sivaram

--
This message and any attachments are intended only for the use of the addressee 
and may contain information that is privileged and confidential. If the reader 
of the message is not the intended recipient or an authorized representative of 
the intended recipient, you are hereby notified that any dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, notify the sender immediately by return email and delete the message 
and any attachments from your system.


RE: how to extract arraytype data to file

2016-10-18 Thread Kappaganthu, Sivaram (ES)
There is an option called Explode for this .

From: lk_spark [mailto:lk_sp...@163.com]
Sent: Wednesday, October 19, 2016 9:06 AM
To: user.spark
Subject: how to extract arraytype data to file

hi,all:
I want to read a json file and search it by sql .
the data struct should be :
bid: string (nullable = true)
code: string (nullable = true)
and the json file data should be like :
 {bid":"MzI4MTI5MzcyNw==","code":"罗甸网警"}
 {"bid":"MzI3MzQ5Nzc2Nw==","code":"西早君"}
but in fact my json file data is :
{"bizs":[ 
{bid":"MzI4MTI5MzcyNw==","code":"罗甸网警"},{"bid":"MzI3MzQ5Nzc2Nw==","code":"西早君"}]}
{"bizs":[ 
{bid":"MzI4MTI5Mzcy00==","code":"罗甸网警"},{"bid":"MzI3MzQ5Nzc201==","code":"西早君"}]}
I load it by spark ,data schema shows like this :
root
 |-- bizs: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- bid: string (nullable = true)
 |||-- code: string (nullable = true)

I can select columns by : df.select("bizs.id","bizs.name")
but the colume values is in array type:
+++
|  id|code|
+++
|[4938200, 4938201...|[罗甸网警, 室内设计师杨焰红, ...|
|[4938300, 4938301...|[SDCS十全九美, 旅梦长大, ...|
|[4938400, 4938401...|[日重重工液压行走回转, 氧老家,...|
|[4938500, 4938501...|[PABXSLZ, 陈少燕, 笑蜜...|
|[4938600, 4938601...|[税海微云, 西域美农云家店, 福...|
+++

what I want is I can read colum in normal row type. how I can do it ?
2016-10-19

lk_spark

--
This message and any attachments are intended only for the use of the addressee 
and may contain information that is privileged and confidential. If the reader 
of the message is not the intended recipient or an authorized representative of 
the intended recipient, you are hereby notified that any dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, notify the sender immediately by return email and delete the message 
and any attachments from your system.


RE: JSON Arrays and Spark

2016-10-12 Thread Kappaganthu, Sivaram (ES)
Hi,

Does this mean that handling any Json with kind of below schema  with spark is 
not a good fit?? I have requirement to parse the below Json that spans across 
multiple lines. Whats the best way to parse the jsns of this kind?? Please 
suggest.

root
|-- maindate: struct (nullable = true)
||-- mainidnId: string (nullable = true)
|-- Entity: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- Profile: struct (nullable = true)
||||-- Kind: string (nullable = true)
|||-- Identifier: string (nullable = true)
|||-- Group: array (nullable = true)
||||-- element: struct (containsNull = true)
|||||-- Period: struct (nullable = true)
||||||-- pid: string (nullable = true)
||||||-- pDate: string (nullable = true)
||||||-- quarter: long (nullable = true)
||||||-- labour: array (nullable = true)
|||||||-- element: struct (containsNull = true)
||||||||-- category: string (nullable = true)
||||||||-- id: string (nullable = true)
||||||||-- person: struct (nullable = true)
|||||||||-- address: array (nullable = true)
||||||||||-- element: struct (containsNull 
= true)
|||||||||||-- city: string (nullable = 
true)
|||||||||||-- line1: string (nullable = 
true)
|||||||||||-- line2: string (nullable = 
true)
|||||||||||-- postalCode: string 
(nullable = true)
|||||||||||-- state: string (nullable = 
true)
|||||||||||-- type: string (nullable = 
true)
|||||||||-- familyName: string (nullable = true)
||||||||-- tax: array (nullable = true)
|||||||||-- element: struct (containsNull = 
true)
||||||||||-- code: string (nullable = true)
||||||||||-- qwage: double (nullable = true)
||||||||||-- qvalue: double (nullable = 
true)
||||||||||-- qSubjectvalue: double 
(nullable = true)
||||||||||-- qfinalvalue: double (nullable 
= true)
||||||||||-- ywage: double (nullable = true)
||||||||||-- yalue: double (nullable = true)
||||||||||-- ySubjectvalue: double 
(nullable = true)
||||||||||-- yfinalvalue: double (nullable 
= true)
||||||||-- tProfile: array (nullable = true)
|||||||||-- element: struct (containsNull = 
true)
||||||||||-- isExempt: boolean (nullable = 
true)
||||||||||-- jurisdiction: struct (nullable 
= true)
|||||||||||-- code: string (nullable = 
true)
||||||||||-- maritalStatus: string 
(nullable = true)
||||||||||-- numberOfDeductions: long 
(nullable = true)
||||||||-- wDate: struct (nullable = true)
|||||||||-- originalHireDate: string (nullable 
= true)
||||||-- year: long (nullable = true)


From: Luciano Resende [mailto:luckbr1...@gmail.com]
Sent: Monday, October 10, 2016 11:39 PM
To: Jean Georges Perrin
Cc: user @spark
Subject: Re: JSON Arrays and Spark

Please take a look at
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
Particularly the note at the required format :

Note that the file that is offered as a json file is not a typical JSON file. 
Each line must contain a separate, self-contained valid JSON object. As a 
consequence, a regular multi-line JSON file will most often fail.


On Mon, Oct 10, 2016 at 9:57 AM, Jean Georges Perrin 
> wrote:
Hi folks,

I am trying to parse JSON arrays and it’s getting a little crazy (for me at 
least)…

1)
If my JSON is:
{"vals":[100,500,600,700,800,200,900,300]}

I get:
++
|vals|
++
|[100, 500, 600, 7...|
++

root
 |-- vals: array (nullable = true)
 ||-- element: long (containsNull = true)

and I am :)

2)
If my JSON is:
[100,500,600,700,800,200,900,300]

I get:
++
| _corrupt_record|
++
|[100,500,600,700,...|
++

root
 |-- _corrupt_record: string (nullable = 

Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Kappaganthu, Sivaram (ES)
Hello,

I am a newbie to spark and I have  below requirement.

Problem statement : A third party application is dumping files continuously in 
a server. Typically the count of files is 100 files  per hour and each file is 
of size less than 50MB. My application has to  process those files.

Here
1) is it possible  for spark-stream to trigger a job after a file is placed 
instead of triggering a job at fixed batch interval?
2) If it is not possible with Spark-streaming, can we control this with 
Kafka/Flume

Thanks,
Sivaram

--
This message and any attachments are intended only for the use of the addressee 
and may contain information that is privileged and confidential. If the reader 
of the message is not the intended recipient or an authorized representative of 
the intended recipient, you are hereby notified that any dissemination of this 
communication is strictly prohibited. If you have received this communication 
in error, notify the sender immediately by return email and delete the message 
and any attachments from your system.