Re: Spark for offline log processing/querying

2016-05-22 Thread Sonal Goyal
Hi Mat,

I think you could also use spark SQL to query the logs. Hope the following
link helps

https://databricks.com/blog/2014/09/23/databricks-reference-applications.html
On May 23, 2016 10:59 AM, "Mat Schaffer"  wrote:

> I'm curious about trying to use spark as a cheap/slow ELK
> (ElasticSearch,Logstash,Kibana) system. Thinking something like:
>
> - instances rotate local logs
> - copy rotated logs to s3
> (s3://logs/region/grouping/instance/service/*.logs)
> - spark to convert from raw text logs to parquet
> - maybe presto to query the parquet?
>
> I'm still new on Spark though, so thought I'd ask if anyone was familiar
> with this sort of thing and if there are maybe some articles or documents I
> should be looking at in order to learn how to build such a thing. Or if
> such a thing even made sense.
>
> Thanks in advance, and apologies if this has already been asked and I
> missed it!
>
> -Mat
>
> matschaffer.com
>


Re:How spark depends on Guava

2016-05-22 Thread Todd


Can someone please take alook at my question?I am spark-shell local mode and 
yarn-client mode.Spark code uses guava library,spark should have guava in place 
during run time.


Thanks.





At 2016-05-23 11:48:58, "Todd"  wrote:

Hi,
In the spark code, guava maven dependency scope is provided, my question is, 
how spark depends on guava during runtime? I looked into the 
spark-assembly-1.6.1-hadoop2.6.1.jar,and didn't find class entries like 
com.google.common.base.Preconditions etc...


Spark for offline log processing/querying

2016-05-22 Thread Mat Schaffer
I'm curious about trying to use spark as a cheap/slow ELK
(ElasticSearch,Logstash,Kibana) system. Thinking something like:

- instances rotate local logs
- copy rotated logs to s3
(s3://logs/region/grouping/instance/service/*.logs)
- spark to convert from raw text logs to parquet
- maybe presto to query the parquet?

I'm still new on Spark though, so thought I'd ask if anyone was familiar
with this sort of thing and if there are maybe some articles or documents I
should be looking at in order to learn how to build such a thing. Or if
such a thing even made sense.

Thanks in advance, and apologies if this has already been asked and I
missed it!

-Mat

matschaffer.com


Re: Handling Empty RDD

2016-05-22 Thread Yogesh Vyas
Hi,
I finally got it working.
I was using the updateStateByKey() function to maintain the previous
value of the state, and I found that the event list was empty. Hence
handling the empty event list by using event.isEmtpy() sort out the
problem.

On Sun, May 22, 2016 at 7:59 PM, Ted Yu  wrote:
> You mean when rdd.isEmpty() returned false, saveAsTextFile still produced
> empty file ?
>
> Can you show code snippet that demonstrates this ?
>
> Cheers
>
> On Sun, May 22, 2016 at 5:17 AM, Yogesh Vyas  wrote:
>>
>> Hi,
>> I am reading files using textFileStream, performing some action onto
>> it and then saving it to HDFS using saveAsTextFile.
>> But whenever there is no file to read, Spark will write and empty RDD(
>> [] ) to HDFS.
>> So, how to handle the empty RDD.
>>
>> I checked rdd.isEmpty() and rdd.count>0, but both of them does not works.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-22 Thread Timur Shenkao
Hi,
Thanks a lot for such interesting comparison. But important questions
remain / to be addressed:

1) How to make 2 versions of Spark live together on the same cluster
(libraries clash, paths, etc.) ?
Most of the Spark users perform ETL, ML operations on Spark as well. So, we
may have 3 Spark installations simultaneously

2) How stable such construction is on INSERT / UPDATE / CTAS operations?
Any problems with writing into specific tables / directories, ORC / Parquet
peculiarities, memory / timeout parameters tuning ?

3) How stable such construction is in multi-user / multi-tenant production
environment when several people make different queries simultaneously?

It's impossible to restart Spark masters, workers several tines a day, tune
it constantly.


On Mon, May 23, 2016 at 2:42 AM, Mich Talebzadeh 
wrote:

> Hi,
>
>
>
> I have done a number of extensive tests using Spark-shell with Hive DB and
> ORC tables.
>
>
>
> Now one issue that we typically face is and I quote:
>
>
>
> Spark is fast as it uses Memory and DAG. Great but when we save data it is
> not fast enough
>
> OK but there is a solution now. If you use Spark with Hive and you are on
> a descent version of Hive >= 0.14, then you can also deploy Spark as
> execution engine for Hive. That will make your application run pretty fast
> as you no longer rely on the old Map-Reduce for Hive engine. In a nutshell
> what you are gaining speed in both querying and storage.
>
>
>
> I have made some comparisons on this set-up and I am sure some of you will
> find it useful.
>
>
>
> The version of Spark I use for Spark queries (Spark as query tool) is 1.6.
>
> The version of Hive I use in Hive 2
>
> The version of Spark I use as Hive execution engine is 1.3.1 It works and
> frankly Spark 1.3.1 as an execution engine is adequate (until we sort out
> the Hadoop libraries mismatch).
>
>
>
> An example I am using Hive on Spark engine to find the min and max of IDs
> for a table with 1 billion rows:
>
>
>
> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
> stddev(id) from oraclehadoop.dummy;
>
> Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>
>
>
>
>
> Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151
>
>
>
> INFO  : Completed compiling
> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
> Time taken: 1.911 seconds
>
> INFO  : Executing
> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
> select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy
>
> INFO  : Query ID =
> hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006
>
> INFO  : Total jobs = 1
>
> INFO  : Launching Job 1 out of 1
>
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
>
>
>
> Query Hive on Spark job[0] stages:
>
> 0
>
> 1
>
> Status: Running (Hive on Spark job[0])
>
> Job Progress Format
>
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
>
> 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>
> 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>
> 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>
> 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22Stage-1_0: 0/1
>
> INFO  :
>
> Query Hive on Spark job[0] stages:
>
> INFO  : 0
>
> INFO  : 1
>
> INFO  :
>
> Status: Running (Hive on Spark job[0])
>
> INFO  : Job Progress Format
>
> CurrentTime StageId_StageAttemptId:
> SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
> [StageCost]
>
> INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1
>
> INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>
> INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22Stage-1_0: 0/1
>
> INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22Stage-1_0: 0/1
>
> 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished   Stage-1_0: 0(+1)/1
>
> 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished   Stage-1_0: 1/1
> Finished
>
> Status: Finished successfully in 53.25 seconds
>
> OK
>
> INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished   Stage-1_0:
> 0(+1)/1
>
> INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished   Stage-1_0:
> 1/1 Finished
>
> INFO  : Status: Finished successfully in 53.25 seconds
>
> INFO  : Completed executing
> command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
> Time taken: 56.337 seconds
>
> INFO  : OK
>
> +-++---+---+--+
>
> | c0  | c1 |  c2   |  c3   |
>
> +-++---+---+--+
>
> | 1   | 1  | 5.0005E7  | 2.8867513459481288E7  |
>
> +-++---+---+--+
>
> 1 row selected (58.529 seconds)
>
>
>
> 58 seconds first run with cold cache is pretty good
>
>
>
> And let us compare it with running the same query 

How spark depends on Guava

2016-05-22 Thread Todd
Hi,
In the spark code, guava maven dependency scope is provided, my question is, 
how spark depends on guava during runtime? I looked into the 
spark-assembly-1.6.1-hadoop2.6.1.jar,and didn't find class entries like 
com.google.common.base.Preconditions etc...


Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-22 Thread Mich Talebzadeh
Hi,



I have done a number of extensive tests using Spark-shell with Hive DB and
ORC tables.



Now one issue that we typically face is and I quote:



Spark is fast as it uses Memory and DAG. Great but when we save data it is
not fast enough

OK but there is a solution now. If you use Spark with Hive and you are on a
descent version of Hive >= 0.14, then you can also deploy Spark as
execution engine for Hive. That will make your application run pretty fast
as you no longer rely on the old Map-Reduce for Hive engine. In a nutshell
what you are gaining speed in both querying and storage.



I have made some comparisons on this set-up and I am sure some of you will
find it useful.



The version of Spark I use for Spark queries (Spark as query tool) is 1.6.

The version of Hive I use in Hive 2

The version of Spark I use as Hive execution engine is 1.3.1 It works and
frankly Spark 1.3.1 as an execution engine is adequate (until we sort out
the Hadoop libraries mismatch).



An example I am using Hive on Spark engine to find the min and max of IDs
for a table with 1 billion rows:



0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
stddev(id) from oraclehadoop.dummy;

Query ID = hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006





Starting Spark Job = 5e092ef9-d798-4952-b156-74df49da9151



INFO  : Completed compiling
command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
Time taken: 1.911 seconds

INFO  : Executing
command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006):
select min(id), max(id),avg(id), stddev(id) from oraclehadoop.dummy

INFO  : Query ID =
hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006

INFO  : Total jobs = 1

INFO  : Launching Job 1 out of 1

INFO  : Starting task [Stage-1:MAPRED] in serial mode



Query Hive on Spark job[0] stages:

0

1

Status: Running (Hive on Spark job[0])

Job Progress Format

CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]

2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1

2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22Stage-1_0: 0/1

2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22Stage-1_0: 0/1

2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22Stage-1_0: 0/1

INFO  :

Query Hive on Spark job[0] stages:

INFO  : 0

INFO  : 1

INFO  :

Status: Running (Hive on Spark job[0])

INFO  : Job Progress Format

CurrentTime StageId_StageAttemptId:
SucceededTasksCount(+RunningTasksCount-FailedTasksCount)/TotalTasksCount
[StageCost]

INFO  : 2016-05-23 00:21:19,062 Stage-0_0: 0/22 Stage-1_0: 0/1

INFO  : 2016-05-23 00:21:20,070 Stage-0_0: 0(+12)/22Stage-1_0: 0/1

INFO  : 2016-05-23 00:21:23,119 Stage-0_0: 0(+12)/22Stage-1_0: 0/1

INFO  : 2016-05-23 00:21:26,156 Stage-0_0: 13(+9)/22Stage-1_0: 0/1

2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished   Stage-1_0: 0(+1)/1

2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished   Stage-1_0: 1/1
Finished

Status: Finished successfully in 53.25 seconds

OK

INFO  : 2016-05-23 00:21:29,181 Stage-0_0: 22/22 Finished   Stage-1_0:
0(+1)/1

INFO  : 2016-05-23 00:21:30,189 Stage-0_0: 22/22 Finished   Stage-1_0:
1/1 Finished

INFO  : Status: Finished successfully in 53.25 seconds

INFO  : Completed executing
command(queryId=hduser_20160523002031_3e22e26e-4293-4e90-ae8b-72fe9683c006);
Time taken: 56.337 seconds

INFO  : OK

+-++---+---+--+

| c0  | c1 |  c2   |  c3   |

+-++---+---+--+

| 1   | 1  | 5.0005E7  | 2.8867513459481288E7  |

+-++---+---+--+

1 row selected (58.529 seconds)



58 seconds first run with cold cache is pretty good



And let us compare it with running the same query on map-reduce engine



: jdbc:hive2://rhes564:10010/default> set hive.execution.engine=mr;

Hive-on-MR is deprecated in Hive 2 and may not be available in the future
versions. Consider using a different execution engine (i.e. spark, tez) or
using Hive 1.X releases.

No rows affected (0.007 seconds)

0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
stddev(id) from oraclehadoop.dummy;

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
future versions. Consider using a different execution engine (i.e. spark,
tez) or using Hive 1.X releases.

Query ID = hduser_20160523002632_9f91d42a-ea46-4a66-a589-7d39c23b41dc

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

  set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

  set hive.exec.reducers.max=

In order to set a constant number of reducers:

  set mapreduce.job.reduces=

Starting Job = job_1463956731753_0005, Tracking URL =

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
Whatever you do the lion share of time is going to be taken by insert into
Hive table.

Ok check this. It is CSV files inserted into Hive ORC table. This version
uses Hive on Spark engine and it is written in Hive executed via beeline

--1 Move .CSV data into HDFS:
--2 Create an external table.
--3 Create the ORC table.
--4 Insert the data from the external table to the Hive ORC table

select from_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') AS
StartTime;
set hive.exec.reducers.max=256;
use accounts;
--set hive.execution.engine=mr;
--2)
DROP TABLE IF EXISTS stg_t2;
CREATE EXTERNAL TABLE stg_t2 (
 INVOICENUMBER string
,PAYMENTDATE string
,NET string
,VAT string
,TOTAL string
)
COMMENT 'from csv file from excel sheet nw_10124772'
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS TEXTFILE
LOCATION '/data/stg/accounts/nw/10124772'
TBLPROPERTIES ("skip.header.line.count"="1")
;
--3)
DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (
 INVOICENUMBER  INT
,PAYMENTDATEdate
,NETDECIMAL(20,2)
,VATDECIMAL(20,2)
,TOTAL  DECIMAL(20,2)
)
COMMENT 'from csv file from excel sheet nw_10124772'
CLUSTERED BY (INVOICENUMBER) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="ZLIB" )
;
--4) Put data in target table. do the conversion and ignore empty rows
INSERT INTO TABLE t2
SELECT
  INVOICENUMBER
,
TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/'),'-MM-dd'))
AS paymentdate
--, CAST(REGEXP_REPLACE(SUBSTR(net,2,20),",","") AS DECIMAL(20,2))
, CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
, CAST(REGEXP_REPLACE(vat,'[^\\d\\.]','') AS DECIMAL(20,2))
, CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
FROM
stg_t2
WHERE
--INVOICENUMBER > 0 AND
CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > 0.0
-- Exclude empty rows
;
select from_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') AS EndTime;
!exit

And similar using Spark shell and temp table

import org.apache.spark.sql.functions._
import java.sql.{Date, Timestamp}
val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
println ("\nStarted at"); sqlContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)
//
// Get a DF first based on Databricks CSV libraries ignore column heading
because of column called "Type"
//
val df =
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header",
"true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772")
//
//  [Date: string,  Type: string,  Description: string,  Value: double,
Balance: double,  Account Name: string,  Account Number: string]
//
case class Accounts( TransactionDate: String, TransactionType: String,
Description: String, Value: Double, Balance: Double, AccountName: String,
AccountNumber : String)
// Map the columns to names
//
val a = df.filter(col("Date") > "").map(p =>
Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString))
//
// Create a Spark temporary table
//
a.toDF.registerTempTable("tmp")
//
// Test it here
//
//sql("select TransactionDate, TransactionType, Description, Value,
Balance, AccountName, AccountNumber from tmp").take(2)
//
// Need to create and populate target ORC table nw_10124772 in database
accounts.in Hive
//
sql("use accounts")
//
// Drop and create table nw_10124772
//
sql("DROP TABLE IF EXISTS accounts.nw_10124772")
var sqltext : String = ""
sqltext = """
CREATE TABLE accounts.nw_10124772 (
TransactionDateDATE
,TransactionType   String
,Description   String
,Value Double
,Balance   Double
,AccountName   String
,AccountNumber Int
)
COMMENT 'from csv file from excel sheet'
STORED AS ORC
TBLPROPERTIES ( "orc.compress"="ZLIB" )
"""
sql(sqltext)
//
// Put data in Hive table. Clean up is already done
//
sqltext = """
INSERT INTO TABLE accounts.nw_10124772
SELECT

TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(TransactionDate,'dd/MM/'),'-MM-dd'))
AS TransactionDate
, TransactionType
, Description
, Value
, Balance
, AccountName
, AccountNumber
FROM tmp
"""
sql(sqltext)
//
// Test all went OK by looking at some old transactions
//
sql("Select TransactionDate, Value, Balance from nw_10124772 where
TransactionDate < '2011-05-30'").collect.foreach(println)
//
println ("\nFinished at"); sqlContext.sql("SELECT
FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
").collect.foreach(println)
sys.exit()


Anyway worth trying

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 22 May 2016 at 

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
I am doing the 1. currently using the following and it takes a lot of time.
Whats the advantage of doing 2 and how to do it?

sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING,
record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING)
stored as ORC LOCATION '/user/users' ")
  sqlContext.sql("  orc.compress= SNAPPY")
  sqlContext.sql(
""" from recordsTemp ps   insert overwrite table users
partition(datePartition , idPartition )  select ps.id, ps.record ,
ps.datePartition, ps.idPartition  """.stripMargin)

On Sun, May 22, 2016 at 12:47 PM, Mich Talebzadeh  wrote:

> two alternatives for this ETL or ELT
>
>
>1. There is only one external ORC table and you do insert overwrite
>into that external table through Spark sql
>2. or
>3. 14k files loaded into staging area/read directory and then insert
>overwrite into an ORC table and th
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 20:38, swetha kasireddy 
> wrote:
>
>> Around 14000 partitions need to be loaded every hour. Yes, I tested this
>> and its taking a lot of time to load. A partition would look something like
>> the following which is further partitioned by userId with all the
>> userRecords for that date inside it.
>>
>> 5 2016-05-20 16:03 /user/user/userRecords/dtPartitioner=2012-09-12
>>
>> On Sun, May 22, 2016 at 12:30 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> by partition do you mean 14000 files loaded in each batch session (say
>>> daily)?.
>>>
>>> Have you actually tested this?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 22 May 2016 at 20:24, swetha kasireddy 
>>> wrote:
>>>
 The data is not very big. Say 1MB-10 MB at the max per partition. What
 is the best way to insert this 14k partitions with decent performance?

 On Sun, May 22, 2016 at 12:18 PM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> the acid question is how many rows are you going to insert in a batch
> session? btw if this is purely an sql operation then you can do all that 
> in
> hive running on spark engine. It will be very fast as well.
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 20:14, Jörn Franke  wrote:
>
>> 14000 partitions seem to be way too many to be performant (except for
>> large data sets). How much data does one partition contain?
>>
>> > On 22 May 2016, at 09:34, SRK  wrote:
>> >
>> > Hi,
>> >
>> > In my Spark SQL query to insert data, I have around 14,000
>> partitions of
>> > data which seems to be causing memory issues. How can I insert the
>> data for
>> > 100 partitions at a time to avoid any memory issues?
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
>> > Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >
>> >
>> -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

>>>
>>
>


Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
two alternatives for this ETL or ELT


   1. There is only one external ORC table and you do insert overwrite into
   that external table through Spark sql
   2. or
   3. 14k files loaded into staging area/read directory and then insert
   overwrite into an ORC table and th



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 22 May 2016 at 20:38, swetha kasireddy  wrote:

> Around 14000 partitions need to be loaded every hour. Yes, I tested this
> and its taking a lot of time to load. A partition would look something like
> the following which is further partitioned by userId with all the
> userRecords for that date inside it.
>
> 5 2016-05-20 16:03 /user/user/userRecords/dtPartitioner=2012-09-12
>
> On Sun, May 22, 2016 at 12:30 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> by partition do you mean 14000 files loaded in each batch session (say
>> daily)?.
>>
>> Have you actually tested this?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 22 May 2016 at 20:24, swetha kasireddy 
>> wrote:
>>
>>> The data is not very big. Say 1MB-10 MB at the max per partition. What
>>> is the best way to insert this 14k partitions with decent performance?
>>>
>>> On Sun, May 22, 2016 at 12:18 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 the acid question is how many rows are you going to insert in a batch
 session? btw if this is purely an sql operation then you can do all that in
 hive running on spark engine. It will be very fast as well.



 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 22 May 2016 at 20:14, Jörn Franke  wrote:

> 14000 partitions seem to be way too many to be performant (except for
> large data sets). How much data does one partition contain?
>
> > On 22 May 2016, at 09:34, SRK  wrote:
> >
> > Hi,
> >
> > In my Spark SQL query to insert data, I have around 14,000
> partitions of
> > data which seems to be causing memory issues. How can I insert the
> data for
> > 100 partitions at a time to avoid any memory issues?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
> > Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>>
>>
>


Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
Around 14000 partitions need to be loaded every hour. Yes, I tested this
and its taking a lot of time to load. A partition would look something like
the following which is further partitioned by userId with all the
userRecords for that date inside it.

5 2016-05-20 16:03 /user/user/userRecords/dtPartitioner=2012-09-12

On Sun, May 22, 2016 at 12:30 PM, Mich Talebzadeh  wrote:

> by partition do you mean 14000 files loaded in each batch session (say
> daily)?.
>
> Have you actually tested this?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 20:24, swetha kasireddy 
> wrote:
>
>> The data is not very big. Say 1MB-10 MB at the max per partition. What is
>> the best way to insert this 14k partitions with decent performance?
>>
>> On Sun, May 22, 2016 at 12:18 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> the acid question is how many rows are you going to insert in a batch
>>> session? btw if this is purely an sql operation then you can do all that in
>>> hive running on spark engine. It will be very fast as well.
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 22 May 2016 at 20:14, Jörn Franke  wrote:
>>>
 14000 partitions seem to be way too many to be performant (except for
 large data sets). How much data does one partition contain?

 > On 22 May 2016, at 09:34, SRK  wrote:
 >
 > Hi,
 >
 > In my Spark SQL query to insert data, I have around 14,000 partitions
 of
 > data which seems to be causing memory issues. How can I insert the
 data for
 > 100 partitions at a time to avoid any memory issues?
 >
 >
 >
 > --
 > View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
 > Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 >
 > -
 > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 > For additional commands, e-mail: user-h...@spark.apache.org
 >

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>


Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
by partition do you mean 14000 files loaded in each batch session (say
daily)?.

Have you actually tested this?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 22 May 2016 at 20:24, swetha kasireddy  wrote:

> The data is not very big. Say 1MB-10 MB at the max per partition. What is
> the best way to insert this 14k partitions with decent performance?
>
> On Sun, May 22, 2016 at 12:18 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> the acid question is how many rows are you going to insert in a batch
>> session? btw if this is purely an sql operation then you can do all that in
>> hive running on spark engine. It will be very fast as well.
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 22 May 2016 at 20:14, Jörn Franke  wrote:
>>
>>> 14000 partitions seem to be way too many to be performant (except for
>>> large data sets). How much data does one partition contain?
>>>
>>> > On 22 May 2016, at 09:34, SRK  wrote:
>>> >
>>> > Hi,
>>> >
>>> > In my Spark SQL query to insert data, I have around 14,000 partitions
>>> of
>>> > data which seems to be causing memory issues. How can I insert the
>>> data for
>>> > 100 partitions at a time to avoid any memory issues?
>>> >
>>> >
>>> >
>>> > --
>>> > View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
>>> > Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >
>>> > -
>>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> > For additional commands, e-mail: user-h...@spark.apache.org
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
The data is not very big. Say 1MB-10 MB at the max per partition. What is
the best way to insert this 14k partitions with decent performance?

On Sun, May 22, 2016 at 12:18 PM, Mich Talebzadeh  wrote:

> the acid question is how many rows are you going to insert in a batch
> session? btw if this is purely an sql operation then you can do all that in
> hive running on spark engine. It will be very fast as well.
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 20:14, Jörn Franke  wrote:
>
>> 14000 partitions seem to be way too many to be performant (except for
>> large data sets). How much data does one partition contain?
>>
>> > On 22 May 2016, at 09:34, SRK  wrote:
>> >
>> > Hi,
>> >
>> > In my Spark SQL query to insert data, I have around 14,000 partitions of
>> > data which seems to be causing memory issues. How can I insert the data
>> for
>> > 100 partitions at a time to avoid any memory issues?
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
the acid question is how many rows are you going to insert in a batch
session? btw if this is purely an sql operation then you can do all that in
hive running on spark engine. It will be very fast as well.



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 22 May 2016 at 20:14, Jörn Franke  wrote:

> 14000 partitions seem to be way too many to be performant (except for
> large data sets). How much data does one partition contain?
>
> > On 22 May 2016, at 09:34, SRK  wrote:
> >
> > Hi,
> >
> > In my Spark SQL query to insert data, I have around 14,000 partitions of
> > data which seems to be causing memory issues. How can I insert the data
> for
> > 100 partitions at a time to avoid any memory issues?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Jörn Franke
14000 partitions seem to be way too many to be performant (except for large 
data sets). How much data does one partition contain?

> On 22 May 2016, at 09:34, SRK  wrote:
> 
> Hi,
> 
> In my Spark SQL query to insert data, I have around 14,000 partitions of
> data which seems to be causing memory issues. How can I insert the data for
> 100 partitions at a time to avoid any memory issues? 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Sabarish Sasidharan
Can't you just reduce the amount of data you insert by applying a filter so
that only a small set of idpartitions is selected. You could have multiple
such inserts to cover all idpartitions. Does that help?

Regards
Sab
On 22 May 2016 1:11 pm, "swetha kasireddy" 
wrote:

> I am looking at ORC. I insert the data using the following query.
>
> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING,
> record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING)
> stored as ORC LOCATION '/user/users' ")
>   sqlContext.sql("  orc.compress= SNAPPY")
>   sqlContext.sql(
> """ from recordsTemp ps   insert overwrite table users
> partition(datePartition , idPartition )  select ps.id, ps.record ,
> ps.datePartition, ps.idPartition  """.stripMargin)
>
> On Sun, May 22, 2016 at 12:37 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> where is your base table and what format is it Parquet, ORC etc)
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 22 May 2016 at 08:34, SRK  wrote:
>>
>>> Hi,
>>>
>>> In my Spark SQL query to insert data, I have around 14,000 partitions of
>>> data which seems to be causing memory issues. How can I insert the data
>>> for
>>> 100 partitions at a time to avoid any memory issues?
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
So, if I put 1000 records at a time and if the next 1000 records have some
records that has same  partition as the previous records then the data will
be overwritten. How can I prevent overwriting valid data in this case?
Could you post the example that you are talking about?

What I am doing is in the final insert into the ORC table, I
insert/overwrite the data. So, I need to have  a way to insert all the data
related to one partition at a time so that it is not overwritten when I
insert the next set of records.

On Sun, May 22, 2016 at 11:51 AM, Mich Talebzadeh  wrote:

> ok is the staging table used as staging only.
>
> you can create a staging *directory^ where you put your data there (you
> can put 100s of files there) and do an insert/select that will take data
> from 100 files into your main ORC table.
>
> I have an example of 100's of CSV files insert/select from a staging
> external table into an ORC table.
>
> My point is you are more likely interested in doing analysis on ORC table
> (read internal) rather than using staging table.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 19:43, swetha kasireddy 
> wrote:
>
>> But, how do I take 100 partitions at a time from staging table?
>>
>> On Sun, May 22, 2016 at 11:26 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> ok so you still keep data as ORC in Hive for further analysis
>>>
>>> what I have in mind is to have an external table as staging table and do
>>> insert into an orc internal table which is bucketed and partitioned.
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 22 May 2016 at 19:11, swetha kasireddy 
>>> wrote:
>>>
 I am looking at ORC. I insert the data using the following query.

 sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS records (id
 STRING,
 record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING)
 stored as ORC LOCATION '/user/users' ")
   sqlContext.sql("  orc.compress= SNAPPY")
   sqlContext.sql(
 """ from recordsTemp ps   insert overwrite table users
 partition(datePartition , idPartition )  select ps.id, ps.record ,
 ps.datePartition, ps.idPartition  """.stripMargin)

 On Sun, May 22, 2016 at 12:37 AM, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:

> where is your base table and what format is it Parquet, ORC etc)
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 08:34, SRK  wrote:
>
>> Hi,
>>
>> In my Spark SQL query to insert data, I have around 14,000 partitions
>> of
>> data which seems to be causing memory issues. How can I insert the
>> data for
>> 100 partitions at a time to avoid any memory issues?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

>>>
>>
>


Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
ok is the staging table used as staging only.

you can create a staging *directory^ where you put your data there (you can
put 100s of files there) and do an insert/select that will take data from
100 files into your main ORC table.

I have an example of 100's of CSV files insert/select from a staging
external table into an ORC table.

My point is you are more likely interested in doing analysis on ORC table
(read internal) rather than using staging table.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 22 May 2016 at 19:43, swetha kasireddy  wrote:

> But, how do I take 100 partitions at a time from staging table?
>
> On Sun, May 22, 2016 at 11:26 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> ok so you still keep data as ORC in Hive for further analysis
>>
>> what I have in mind is to have an external table as staging table and do
>> insert into an orc internal table which is bucketed and partitioned.
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 22 May 2016 at 19:11, swetha kasireddy 
>> wrote:
>>
>>> I am looking at ORC. I insert the data using the following query.
>>>
>>> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING,
>>> record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING)
>>> stored as ORC LOCATION '/user/users' ")
>>>   sqlContext.sql("  orc.compress= SNAPPY")
>>>   sqlContext.sql(
>>> """ from recordsTemp ps   insert overwrite table users
>>> partition(datePartition , idPartition )  select ps.id, ps.record ,
>>> ps.datePartition, ps.idPartition  """.stripMargin)
>>>
>>> On Sun, May 22, 2016 at 12:37 AM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 where is your base table and what format is it Parquet, ORC etc)



 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 22 May 2016 at 08:34, SRK  wrote:

> Hi,
>
> In my Spark SQL query to insert data, I have around 14,000 partitions
> of
> data which seems to be causing memory issues. How can I insert the
> data for
> 100 partitions at a time to avoid any memory issues?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>>
>>
>


Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
But, how do I take 100 partitions at a time from staging table?

On Sun, May 22, 2016 at 11:26 AM, Mich Talebzadeh  wrote:

> ok so you still keep data as ORC in Hive for further analysis
>
> what I have in mind is to have an external table as staging table and do
> insert into an orc internal table which is bucketed and partitioned.
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 19:11, swetha kasireddy 
> wrote:
>
>> I am looking at ORC. I insert the data using the following query.
>>
>> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING,
>> record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING)
>> stored as ORC LOCATION '/user/users' ")
>>   sqlContext.sql("  orc.compress= SNAPPY")
>>   sqlContext.sql(
>> """ from recordsTemp ps   insert overwrite table users
>> partition(datePartition , idPartition )  select ps.id, ps.record ,
>> ps.datePartition, ps.idPartition  """.stripMargin)
>>
>> On Sun, May 22, 2016 at 12:37 AM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> where is your base table and what format is it Parquet, ORC etc)
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 22 May 2016 at 08:34, SRK  wrote:
>>>
 Hi,

 In my Spark SQL query to insert data, I have around 14,000 partitions of
 data which seems to be causing memory issues. How can I insert the data
 for
 100 partitions at a time to avoid any memory issues?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>


Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
I am looking at ORC. I insert the data using the following query.

sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING,
record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING)
stored as ORC LOCATION '/user/users' ")
  sqlContext.sql("  orc.compress= SNAPPY")
  sqlContext.sql(
""" from recordsTemp ps   insert overwrite table users
partition(datePartition , idPartition )  select ps.id, ps.record ,
ps.datePartition, ps.idPartition  """.stripMargin)

On Sun, May 22, 2016 at 12:37 AM, Mich Talebzadeh  wrote:

> where is your base table and what format is it Parquet, ORC etc)
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 08:34, SRK  wrote:
>
>> Hi,
>>
>> In my Spark SQL query to insert data, I have around 14,000 partitions of
>> data which seems to be causing memory issues. How can I insert the data
>> for
>> 100 partitions at a time to avoid any memory issues?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Hive 2 Metastore Entity-Relationship Diagram, Base tables

2016-05-22 Thread Mich Talebzadeh
for now to be used as a quick reference for hive metadata tables, columns,
pk and constraint. It only covers the base tables excluding transactional
add ons in

hive-txn-schema-2.0.0.oracle.sql

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 22 May 2016 at 15:55, Denise Rogers  wrote:

> What capability is this construct designed to support? Reporting, ad hoc
> query? I need a context.
>
> Regards,
> Denise
> Cell - (860)989-3431
>
> Sent from mi iPhone
>
> On May 22, 2016, at 3:36 AM, Mich Talebzadeh 
> wrote:
>
>
> Hi,
>
> This is Hive 2 Entity-Relationship Diagrams created by running the
> following script against Oracle Hive schema
>
> ${HIVE_HOME}/scripts/metastore/upgrade/oracle/hive-schema-2.0.0.oracle.sql
>
> It creates 55 tables, primary key, foreign key and other constraints.
>
> It does not have any views.
>
> I trust that the diagram in PDF format can be useful.
>
> Any overlapping entities etc please comment and I will try to sort it out
> in the next draft.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> 
>
>


Re: Handling Empty RDD

2016-05-22 Thread Ted Yu
You mean when rdd.isEmpty() returned false, saveAsTextFile still produced
empty file ?

Can you show code snippet that demonstrates this ?

Cheers

On Sun, May 22, 2016 at 5:17 AM, Yogesh Vyas  wrote:

> Hi,
> I am reading files using textFileStream, performing some action onto
> it and then saving it to HDFS using saveAsTextFile.
> But whenever there is no file to read, Spark will write and empty RDD(
> [] ) to HDFS.
> So, how to handle the empty RDD.
>
> I checked rdd.isEmpty() and rdd.count>0, but both of them does not works.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Handling Empty RDD

2016-05-22 Thread Yogesh Vyas
Hi,
I am reading files using textFileStream, performing some action onto
it and then saving it to HDFS using saveAsTextFile.
But whenever there is no file to read, Spark will write and empty RDD(
[] ) to HDFS.
So, how to handle the empty RDD.

I checked rdd.isEmpty() and rdd.count>0, but both of them does not works.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to change Spark DataFrame groupby("col1",..,"coln") into reduceByKey()?

2016-05-22 Thread unk1102
Hi I have Spark job which does group by and I cant avoid it because of my use
case. I have large dataset around 1 TB which I need to process/update in
DataFrame. Now my jobs shuffles huge data and slows things because of
shuffling and groupby. One reason I see is my data is skew some of my group
by keys are empty. How do I avoid empty group by keys in DataFrame? Does
DataFrame avoid empty group by key? I have around 8 keys on which I do group
by. 

sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla
bla");

How do I change above code into using reduceByKey() can we apply aggregation
on reduceByKey()? Please guide. Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-change-Spark-DataFrame-groupby-col1-coln-into-reduceByKey-tp26998.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Unsubscribe

2016-05-22 Thread Shekhar Kumar
Please Unsubscribe


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to map values read from text file to 2 different set of RDDs

2016-05-22 Thread Deepak Sharma
Hi
I am reading a text file with 16 fields.
All the place holders for the values of this text file has been defined in
say 2 different case classes:
Case1 and Case2

How do i map values read from text file , so my function in scala should be
able to return 2 different RDDs , with each each RDD of these 2 different
cse class type?

-- 
Thanks
Deepak


Re: unsubscribe

2016-05-22 Thread junius zhou
unsubscribe

On Tue, May 17, 2016 at 5:57 PM, aruna jakhmola 
wrote:

>
>


Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
where is your base table and what format is it Parquet, ORC etc)



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 22 May 2016 at 08:34, SRK  wrote:

> Hi,
>
> In my Spark SQL query to insert data, I have around 14,000 partitions of
> data which seems to be causing memory issues. How can I insert the data for
> 100 partitions at a time to avoid any memory issues?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread SRK
Hi,

In my Spark SQL query to insert data, I have around 14,000 partitions of
data which seems to be causing memory issues. How can I insert the data for
100 partitions at a time to avoid any memory issues? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-for-100-partitions-at-a-time-using-Spark-SQL-tp26997.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-22 Thread Amit Sela
I need to update this ;)
To start with, you could just take a look at branch-2.0.

On Sun, May 22, 2016, 01:23 Ovidiu-Cristian MARCU <
ovidiu-cristian.ma...@inria.fr> wrote:

> Thank you, Amit! I was looking for this kind of information.
>
> I did not fully read your paper, I see in it a TODO with basically the
> same question(s) [1], maybe someone from Spark team (including Databricks)
> will be so kind to send some feedback..
>
> Best,
> Ovidiu
>
> [1] Integrate “Structured Streaming”: //TODO - What (and how) will Spark
> 2.0 support (out-of-order, event-time windows, watermarks, triggers,
> accumulation modes) - how straight forward will it be to integrate with the
> Beam Model ?
>
>
> On 21 May 2016, at 23:00, Sela, Amit  wrote:
>
> It seems I forgot to add the link to the “Technical Vision” paper so there
> it is -
> https://docs.google.com/document/d/1y4qlQinjjrusGWlgq-mYmbxRW2z7-_X5Xax-GG0YsC0/edit?usp=sharing
>
> From: "Sela, Amit" 
> Date: Saturday, May 21, 2016 at 11:52 PM
> To: Ovidiu-Cristian MARCU , "user @spark"
> 
> Cc: Ovidiu Cristian Marcu 
> Subject: Re: What / Where / When / How questions in Spark 2.0 ?
>
> This is a “Technical Vision” paper for the Spark runner, which provides
> general guidelines to the future development of Spark’s Beam support as
> part of the Apache Beam (incubating) project.
> This is our JIRA -
> https://issues.apache.org/jira/browse/BEAM/component/12328915/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-summary-panel
>
> Generally, I’m currently working on Datasets integration for Batch (to
> replace RDD) against Spark 1.6, and going towards enhancing Stream
> processing capabilities with Structured Streaming (2.0)
>
> And you’re welcomed to ask those questions at the Apache Beam (incubating)
> mailing list as well ;)
> http://beam.incubator.apache.org/mailing_lists/
>
> Thanks,
> Amit
>
> From: Ovidiu-Cristian MARCU 
> Date: Tuesday, May 17, 2016 at 12:11 AM
> To: "user @spark" 
> Cc: Ovidiu Cristian Marcu 
> Subject: Re: What / Where / When / How questions in Spark 2.0 ?
>
> Could you please consider a short answer regarding the Apache Beam
> Capability Matrix todo’s for future Spark 2.0 release [4]? (some related
> references below [5][6])
>
> Thanks
>
> [4] http://beam.incubator.apache.org/capability-matrix/#cap-full-what
> [5] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
> [6] https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
>
> On 16 May 2016, at 14:18, Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr> wrote:
>
> Hi,
>
> We can see in [2] many interesting (and expected!) improvements (promises)
> like extended SQL support, unified API (DataFrames, DataSets), improved
> engine (Tungsten relates to ideas from modern compilers and MPP databases -
> similar to Flink [3]), structured streaming etc. It seems we somehow assist
> at a smart unification of Big Data analytics (Spark, Flink - best of two
> worlds)!
>
> *How does Spark respond to the missing What/Where/When/How questions
> (capabilities) highlighted in the unified model Beam [1] ?*
>
> Best,
> Ovidiu
>
> [1]
> https://cloud.google.com/blog/big-data/2016/05/why-apache-beam-a-google-perspective
> [2]
> https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
> [3] http://stratosphere.eu/project/publications/
>
>
>
>
>