Re: Hive From Spark: Jdbc VS sparkContext

2017-10-13 Thread Kabeer Ahmed
My take on this might sound a bit different. Here are few points to consider 
below:

1. Going through  Hive JDBC means that the application is restricted by the # 
of queries that can be compiled. HS2 can only compile one SQL at a time and if 
users have bad SQL, it can take a long time just to compile (not map reduce). 
This will reduce the query throughput i.e. # of queries you can fire through 
the JDBC.

2. Going through Hive JDBC does have an advantage that HMS service is 
protected. The JIRA: https://issues.apache.org/jira/browse/HIVE-13884 does 
protect HMS from crashing - because at the end of the day retrieving metadata 
about a Hive table that may have millions or simply put 1000s of partitions 
hits jvm limit on the array size that it can hold for the metadata retrieved. 
JVM array size limit is hit and there is a crash on HMS. So in effect this is 
good to have to protect HMS & the relational database on its back end.

Note: Hive community does propose to move the database to HBase that scales but 
I dont think this will get implemented sooner.

3. Going through the SparkContext, it directly interfaces with the Hive 
MetaStore. I have tried to put a sequence of code flow below. The bit I didnt 
have time to dive into is that I believe if the table is really large i.e. say 
partitions in the table are more than 32K (size of a short) then some sort of 
slicing does occur (I didnt have time to dive and get this piece of code but 
from experience this does seem to occur).

Code flow:
Spark uses Hive External catalog - goo.gl/7CZcDw
HiveClient version of getPartitions is -> goo.gl/ZAEsqQ
HiveClientImpl of getPartitions is: -> goo.gl/msPrr5
The Hive call is made at: -> goo.gl/TB4NFU
ThriftHiveMetastore.java ->  get_partitions_ps_with_auth

-1 value is sent within Spark all the way throughout to Hive Metastore thrift. 
So in effect for large tables at a time 32K partitions are retrieved. This also 
has led to a few HMS crashes but I am yet to identify if this is really the 
cause.


Based on the 3 points above, I would prefer to use SparkContext. If the cause 
of crash is indeed high # of partitions retrieval, then I may opt for the JDBC 
route.

Thanks
Kabeer.


On Fri, 13 Oct 2017 09:22:37 +0200, Nicolas Paris wrote:
>> In case a table has a few
>> million records, it all goes through the driver.
>
> This sounds clear in JDBC mode, the driver get all the rows and then it
> spreads the RDD over the executors.
>
> I d'say that most use cases deal with SQL to aggregate huge datasets,
> and retrieve small amount of rows to be then transformed for ML tasks.
> Then using JDBC offers the robustness of HIVE to produce a small aggregated
> dataset into spark. While using SPARK SQL uses RDD to produce the small
> one from huge.
>
> Not very clear how SPARK SQL deal with huge HIVE table. Does it load
> everything into memory and crash, or does this never happend?
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


--
Sent using Dekko from my Ubuntu device

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-20 Thread Kabeer Ahmed
Thank you Takeshi.As far as I see from the code pointed, the default number of bytes to pack in a partition is set to 128MB - size of the parquet block size. Daniel,It seems you do have a need to modify the number of bytes you want to pack per partition. I am curious to know the scenario. Please share if you can. Thanks,Kabeer.

On May 20 2017, at 4:54 pm, Takeshi Yamamuro  wrote:


  I think this document points to a logic here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L418This logic merge small files into a partition and you can control this threshold via `spark.sql.files.maxPartitionBytes`.// maropuOn Sat, May 20, 2017 at 8:15 AM, ayan guha  wrote:I think like all other read operations, it is driven by input format used, and I think some variation of combine file input format is used by default. I think you can test it by force a particular input format which gets ine file per split, then you should end up with same number of partitions as your dsta filesOn Sat, 20 May 2017 at 5:12 am, Aakash Basu  wrote:Hey all,A reply on this would be great!Thanks,A.B.On 17-May-2017 1:43 AM, "Daniel Siegmann"  wrote:When using spark.read on a large number of small files, these are automatically coalesced into fewer partitions. The only documentation I can find on this is in the Spark 2.0.0 release notes, where it simply says (http://spark.apache.org/releases/spark-release-2-0-0.html):"Automatic file coalescing for native data sources"Can anyone point me to documentation explaining what triggers this feature, how it decides how many partitions to coalesce to, and what counts as a "native data source"? I couldn't find any mention of this feature in the SQL Programming Guide and Google was not helpful.--Daniel SiegmannSenior Software EngineerSecurityScorecard Inc.214 W 29th Street, 5th FloorNew York, NY 10001


-- Best Regards,Ayan Guha
-- ---Takeshi Yamamuro



  

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Setting conf options in jupyter

2016-10-02 Thread Kabeer Ahmed

William:

 

Try something based on the below lines. You should get rid of the error that you have reported.

 

HTH,

 

scala> sc.stop

 

scala> :paste

// Entering paste mode (ctrl-D to finish)

 

import org.apache.spark._

val sc = new SparkContext(

      new SparkConf().setAppName("bar")

      )

 

Sent: Thursday, September 29, 2016 at 9:23 PM
From: "William Kupersanin" 
To: user@spark.apache.org
Subject: Setting conf options in jupyter


Hello,
 

I am trying to figure out how to correctly set config options in jupyter when I am already provided a SparkContext and a HiveContext. I need to increase a couple of memory allocations. My program dies indicating that I am trying to call methods on a stopped SparkContext. I thought I had created a new one with the new conf so I am not sure why 

 

My code is as follows:

 

from pyspark import SparkConf, SparkContext


from pyspark.sql import HiveContext

from pyspark.sql import SQLContext

conf = (SparkConf()

        .set("spark.yarn.executor.memoryOverhead", "4096")

       .set("spark.kryoserializer.buffer.max.mb", "1024"))

 

sc.stop()

sc = SparkContext(conf=conf)

sqlContext2 = SQLContext.getOrCreate(sc)



starttime = time.time()

sampledate = "20160913"

networkdf = sqlContext2.read.json("/sp/network/" + sampledate + "/03/*")


 

 


An error occurred while calling o144.json.

: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.

This stopped SparkContext was created at:







-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark to HBase Fast Bulk Upload

2016-09-19 Thread Kabeer Ahmed
Hi,

Without using Spark there are a couple of options. You can refer to the link: 
http://blog.cloudera.com/blog/2013/09/how-to-use-hbase-bulk-loading-and-why/.

The gist is that you convert the data into HFiles and use the bulk upload 
option to get the data quickly into HBase.

HTH
Kabeer.

On Mon, 19 Sep, 2016 at 12:59 PM, Punit Naik  wrote:
Hi Guys

I have a huge dataset (~ 1TB) which has about a billion records. I have to 
transfer it to an HBase table. What is the fastest way of doing it?

--
Thank You

Regards

Punit Naik




Re: read parquetfile in spark-sql error

2016-07-25 Thread Kabeer Ahmed
I hope the below sample helps you:

val parquetDF = hiveContext.read.parquet("hdfs://.parquet")
parquetDF.registerTempTable("parquetTable")
sql("SELECT * FROM parquetTable").collect().foreach(println)

Kabeer.
Sent from
 Nylas 
N1,
 the extensible, open source mail client.

On Jul 25 2016, at 12:09 pm, cj <124411...@qq.com> wrote:
hi,all:

  I use spark1.6.1 as my work env.

  when I saved the following content as test1.sql file :


CREATE TEMPORARY TABLE parquetTable

USING org.apache.spark.sql.parquet
OPTIONS (
  path "examples/src/main/resources/people.parquet"
)

SELECT * FROM parquetTable

and use bin/spark-sql to run it 
(/home/bae/dataplatform/spark-1.6.1/bin/spark-sql  --properties-file 
./spark-dataplatform.conf -f test1.sql ),I encountered a grammar error.


SET hive.support.sql11.reserved.keywords=false
SET spark.sql.hive.version=1.2.1
SET spark.sql.hive.version=1.2.1
NoViableAltException(280@[192:1: tableName : (db= identifier DOT tab= 
identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) 
);])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:116)
at 
org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45918)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveParser.java:5029)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2640)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1650)
at 
org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1109)
at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
at 
org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.spark.sql.hive.HiveQl$.getAst(HiveQl.scala:276)
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:303)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:41)
at 
org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:40)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
at org.apache.spark.sql.hive.HiveQl$.parseSql(HiveQl.scala:295)
at 
org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:66)
at 
org.apache.spark.sql.hive.HiveQLDialect$$anonfun$parse$1.apply(HiveContext.scala:66)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:290)
at 
org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:237)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:236)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:279)
at org.apache.spark.sql.hive.HiveQLDialect.parse(HiveContext.scala:65)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at 
org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$

Re: write and call UDF in spark dataframe

2016-07-21 Thread Kabeer Ahmed
Divya:

https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html

The link gives a complete example of registering a udAf - user defined 
aggregate function. This is a complete example and this example should give you 
a complete idea of registering a UDF. If you still need a hand let us know.

Thanks
Kabeer.

Sent from Nylas 
N1,
 the extensible, open source mail client.

On Jul 21 2016, at 8:13 am, Jacek Laskowski  wrote:

On Thu, Jul 21, 2016 at 5:53 AM, Mich Talebzadeh
 wrote:
> something similar

Is this going to be in Scala?

> def ChangeToDate (word : String) : Date = {
> //return
> TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(word,"dd/MM/"),"-MM-dd"))
> val d1 = Date.valueOf(ReverseDate(word))
> return d1
> }
> sqlContext.udf.register("ChangeToDate", ChangeToDate(_:String))

then...please use lowercase method names and *no* return please ;-)

BTW, no sqlContext as of Spark 2.0. Sorry.../me smiling nicely

Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Adding hive context gives error

2016-03-07 Thread Kabeer Ahmed
I use SBT and I have never included spark-sql. The simple 2 lines in SBT are as 
below:




libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "1.5.0",
  "org.apache.spark" %% "spark-hive" % "1.5.0"
)



However, I do note that you are using Spark-sql include and the Spark version 
you use is 1.6.0. Can you please try with 1.5.0 to see if it works? I havent 
yet tried Spark 1.6.0.


On 08/03/16 00:15, Suniti Singh wrote:
Hi All,

I am trying to create a hive context in a scala prog as follows in eclipse:
Note --  i have added the maven dependency for spark -core , hive , and sql.


import org.apache.spark.SparkConf

import org.apache.spark.SparkContext

import org.apache.spark.rdd.RDD.rddToPairRDDFunctions

object DataExp {

   def main(args: Array[String]) = {

  val conf = new SparkConf().setAppName("DataExp").setMaster("local")

  val sc = new SparkContext(conf)

  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)

 }

}

I get the the following errors: @ line of hiveContext above in the prog

1 --- Error in Scala compiler: bad symbolic reference. A signature in 
HiveContext.class refers to term ui in package org.apache.spark.sql.execution 
which is not available. It may be completely missing from the current 
classpath, or the version on the classpath might be incompatible with the 
version used when compiling HiveContext.class. spark Unknown Scala Problem

2 --- SBT builder crashed while compiling. The error message is 'bad symbolic 
reference. A signature in HiveContext.class refers to term ui in package 
org.apache.spark.sql.execution which is not available. It may be completely 
missing from the current classpath, or the version on the classpath might be 
incompatible with the version used when compiling HiveContext.class.'. Check 
Error Log for details. spark Unknown Scala Problem

3 --- while compiling: 
/Users/sunitisingh/sparktest/spark/src/main/scala/com/sparktest/spark/DataExp.scala
 during phase: erasure  library version: version 2.10.6 
compiler version: version 2.10.6   reconstructed args: -javabootclasspath 
/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/resources.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/rt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/jsse.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/jce.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/charsets.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/jfr.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/ext/cldrdata.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/ext/dnsns.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/ext/jaccess.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/ext/jfxrt.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/ext/localedata.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.
 
jdk/Contents/Home/jre/lib/ext/nashorn.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/ext/sunec.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/ext/sunjce_provider.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/ext/sunpkcs11.jar:/Library/Java/JavaVirtualMachines/jdk1.8.0_60.jdk/Contents/Home/jre/lib/ext/zipfs.jar:/System/Library/Java/Extensions/AppleScriptEngine.jar:/System/Library/Java/Extensions/dns_sd.jar:/System/Library/Java/Extensions/j3daudio.jar:/System/Library/Java/Extensions/j3dcore.jar:/System/Library/Java/Extensions/j3dutils.jar:/System/Library/Java/Extensions/jai_codec.jar:/System/Library/Java/Extensions/jai_core.jar:/System/Library/Java/Extensions/mlibwrapper_jai.jar:/System/Library/Java/Extensions/MRJToolkit.jar:/System/Library/Java/Extensions/vecmath.jar
 -classpath 
/Users/sunitisingh/sparktest/spark/target/classes:/Users/sunitisingh/sparktest/spark/target/test-classes:/Users/sunitisingh/.m2/repository/org/apache/spark/spark-core_2.10/1.6.0/spark-core_2.10-1.6.0.jar:/Users/sunitisingh/.m2/repository/org/apache/avro/avro-mapred/1.7.7/avro-mapred-1.7.7-hadoop2.jar:/Users/sunitisingh/.m2/repository/org/apache/avro/avro-ipc/1.7.7/avro-ipc-1.7.7.jar:/Users/sunitisingh/.m2/repository/org/apache/avro/avro-ipc/1.7.7/avro-ipc-1.7.7-tests.jar:/Users/sunitisingh/.m2/repository/org/codehaus/jackson/jackson-core-asl/1.9.13/jackson-core-asl-1.9.13.jar:/Users/sunitisingh/.m2/repository/com/twitter/chill_2.10/0.5.0/chill_2.10-0.5.0.jar:/Users/sunitisingh/.m2/repository/com/esotericsoftware/kryo/kryo/2.21/kryo-2.21.jar:/Users/sunitisingh/.m2/repository/com/esotericsoftware/reflectasm/reflectasm/1.07/reflectasm-1.07-shaded.jar:/Users/sunitisingh/.m2/repository/com/esotericsoftware/minlog/minlog/1.2/minlog-1.2.jar:/Users/sunitisingh/.m2/repository/org/objenesis/obj
 
enesis/1.2/objenes

Re: UDAF support for DataFrames in Spark 1.5.0?

2016-02-18 Thread Kabeer Ahmed
I use Spark 1.5 with CDH5.5 distribution and I see that support is present for 
UDAF. From the link: 
https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html,
 I read that this is an experimental feature. So it makes sense not to find 
this in the documentation.

For confirmation whether it works in Spark 1.5 I quickly tried out the example 
in the link and it works. I hope this answers your question.

Kabeer.

On 18/02/16 16:31, Richard Cobbe wrote:

I'm working on an application using DataFrames (Scala API) in Spark 1.5.0,
and we need to define and use several custom aggregators.  I'm having
trouble figuring out how to do this, however.

First, which version of Spark did UDAF support land in?  Has it in fact
landed at all?

https://issues.apache.org/jira/browse/SPARK-3947 suggests that UDAFs should
be available in 1.5.0.  However, the associated pull request includes
classes like org.apache.spark.sql.UDAFRegistration, but these classes don't
appear in the API docs, and I'm not able to use them from the spark shell
("type UDAFRegistration is not a member of package org.apache.spark.sql").

I don't have access to a Spark 1.6.0 installation, but UDAFRegistration
doesn't appear in the Scaladoc pages for 1.6.

Second, assuming that this functionality is supported in some version of
Spark, could someone point me to some documentation or an example that
demonstrates how to define and use a custom aggregation function?

Many thanks,

Richard

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.org