Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-04 Thread Isabelle Phan
Thanks all for your reply! I tested both approaches: registering the temp table then executing SQL vs. saving to HDFS filepath directly. The problem with the second approach is that I am inserting data into a Hive table, so if I create a new partition with this method, Hive metadata is not

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-02 Thread Michael Armbrust
you might also coalesce to 1 (or some small number) before writing to avoid creating a lot of files in that partition if you know that there is not a ton of data. On Wed, Dec 2, 2015 at 12:59 AM, Rishi Mishra wrote: > As long as all your data is being inserted by Spark ,

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-02 Thread Rishi Mishra
As long as all your data is being inserted by Spark , hence using the same hash partitioner, what Fengdong mentioned should work. On Wed, Dec 2, 2015 at 9:32 AM, Fengdong Yu wrote: > Hi > you can try: > > if your table under location “/test/table/“ on HDFS > and has

Re: sparkSQL Load multiple tables

2015-12-02 Thread Jeff Zhang
Do you want to load multiple tables by using sql ? JdbcRelation now only can load single table. It doesn't accept sql as loading command. On Wed, Dec 2, 2015 at 4:33 PM, censj wrote: > hi Fengdong Yu: > I want to use sqlContext.read.format('jdbc').options( ... ).load()

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Jeff Zhang
I don't think there's api for that, but think it is reasonable and helpful for ETL. As a workaround you can first register your dataframe as temp table, and use sql to insert to the static partition. On Wed, Dec 2, 2015 at 10:50 AM, Isabelle Phan wrote: > Hello, > > Is there

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Fengdong Yu
Hi you can try: if your table under location “/test/table/“ on HDFS and has partitions: “/test/table/dt=2012” “/test/table/dt=2013” df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table") > On Dec 2, 2015, at 10:50 AM, Isabelle Phan wrote: > >

MySQLSyntaxErrorException when connect hive to sparksql

2015-11-26 Thread luohui20001
hi guys, when I am trying to connect hive with spark-sql,I got a problem like below: [root@master spark]# bin/spark-shell --master local[4]log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).log4j:WARN Please initialize the log4j system

Re: MySQLSyntaxErrorException when connect hive to sparksql

2015-11-26 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/q3RTtCoKmv14Hd1H1=Re+Spark+Hive+max+key+length+is+767+bytes On Thu, Nov 26, 2015 at 5:26 AM, wrote: > hi guys, > > when I am trying to connect hive with spark-sql,I got a problem like > below: > > > [root@master

Getting different DESCRIBE results between SparkSQL and Hive

2015-11-23 Thread YaoPau
'smallint', ''], ['day', 'smallint', '']] In SparkSQL: hc.sql("DESCRIBE pub.inventory_daily").collect() [Row(col_name=u'effective_date', data_type=u'string', comment=u''), Row(col_name=u'listing_skey', data_type=u'int', comment=u''), Row(col_name=u'car_durable_key', data_type=u'int'

Re: Size exceeds Integer.MAX_VALUE (SparkSQL$TreeNodeException: sort, tree) on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Zhang, Jingyu
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: sort, tree: Sort [net_site#50 ASC,device#6 ASC], true Exchange (RangePartitioning 200) Project [net_site#50,device#6,total_count#105L,adblock_count#106L,noanalytics_count#107L,unique_nk_count#109L] HashOuterJoin

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
u can also put the C* seed node > address in the spark-defaults.conf file under the SPARK_HOME/conf > directory. Then you don’t need to manually SET it for each Beeline session. > > > > Mohammed > > > > *From:* Bryan Jeffrey [mailto:bryan.jeff...@gmail.com] > *Sent:* Thu

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
Bryan Jeffrey [mailto:bryan.jeff...@gmail.com] > *Sent:* Thursday, November 12, 2015 9:12 AM > *To:* Mohammed Guller > *Cc:* user > *Subject:* Re: Cassandra via SparkSQL/Hive JDBC > > > > Mohammed, > > > > That is great. It looks like a perfect scenario. Would

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
e: > >> Did you mean Hive or Spark SQL JDBC/ODBC server? >> >> >> >> Mohammed >> >> >> >> *From:* Bryan Jeffrey [mailto:bryan.jeff...@gmail.com] >> *Sent:* Thursday, November 12, 2015 9:12 AM >> *To:* Mohammed Guller >> *Cc:*

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Mohammed Guller
Did you mean Hive or Spark SQL JDBC/ODBC server? Mohammed From: Bryan Jeffrey [mailto:bryan.jeff...@gmail.com] Sent: Thursday, November 12, 2015 9:12 AM To: Mohammed Guller Cc: user Subject: Re: Cassandra via SparkSQL/Hive JDBC Mohammed, That is great. It looks like a perfect scenario. Would

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
eck that I > presented at the Cassandra Summit 2015: > > http://www.slideshare.net/mg007/ad-hoc-analytics-with-cassandra-and-spark > > > > > > Mohammed > > > > *From:* Bryan [mailto:bryan.jeff...@gmail.com] > *Sent:* Tuesday, November 10, 2015 7:42

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
PM, Mohammed Guller <moham...@glassbeam.com >> > wrote: >> >>> Did you mean Hive or Spark SQL JDBC/ODBC server? >>> >>> >>> >>> Mohammed >>> >>> >>> >>> *From:* Bryan Jeffrey [mailto:bryan.jeff...@gmail.com]

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Mohammed Guller
to manually SET it for each Beeline session. Mohammed From: Bryan Jeffrey [mailto:bryan.jeff...@gmail.com] Sent: Thursday, November 12, 2015 10:26 AM To: Mohammed Guller Cc: user Subject: Re: Cassandra via SparkSQL/Hive JDBC Answer: In beeline run the following: SET spark.cassandra.connection.host

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-11 Thread java8964
Any reason that Spark Cassandra connector won't work for you? Yong To: bryan.jeff...@gmail.com; user@spark.apache.org From: bryan.jeff...@gmail.com Subject: RE: Cassandra via SparkSQL/Hive JDBC Date: Tue, 10 Nov 2015 22:42:13 -0500 Anyone have thoughts or a similar use-case for SparkSQL

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-11 Thread Mohammed Guller
presented at the Cassandra Summit 2015: http://www.slideshare.net/mg007/ad-hoc-analytics-with-cassandra-and-spark Mohammed From: Bryan [mailto:bryan.jeff...@gmail.com] Sent: Tuesday, November 10, 2015 7:42 PM To: Bryan Jeffrey; user Subject: RE: Cassandra via SparkSQL/Hive JDBC Anyone have

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-10 Thread Bryan
Anyone have thoughts or a similar use-case for SparkSQL / Cassandra? Regards, Bryan Jeffrey -Original Message- From: "Bryan Jeffrey" <bryan.jeff...@gmail.com> Sent: ‎11/‎4/‎2015 11:16 AM To: "user" <user@spark.apache.org> Subject: Cassandra via SparkSQL

Re: SparkSQL JDBC to PostGIS

2015-11-05 Thread Mustafa Elbehery
Baghino < stefano.bagh...@radicalbit.io> wrote: > Hi Mustafa, > > are you trying to run geospatial queries on the PostGIS DB with SparkSQL? > Correct me if I'm wrong, but I think SparkSQL itself would need to support > the geospatial extensions in order for this to work. > >

SparkSQL JDBC to PostGIS

2015-11-04 Thread Mustafa Elbehery
Hi Folks, I am trying to connect from SparkShell to PostGIS Database. Simply PostGIS is a *spatial *extension for Postgresql, in order to support *geometry * types. Although the JDBC connection from spark works well with Postgresql, it does not with a database on the same server, which supports

Re: SparkSQL JDBC to PostGIS

2015-11-04 Thread Stefano Baghino
Hi Mustafa, are you trying to run geospatial queries on the PostGIS DB with SparkSQL? Correct me if I'm wrong, but I think SparkSQL itself would need to support the geospatial extensions in order for this to work. On Wed, Nov 4, 2015 at 1:46 PM, Mustafa Elbehery <elbeherymust...@gmail.com>

Re: SparkSQL implicit conversion on insert

2015-11-03 Thread Michael Armbrust
Today you have to do an explicit conversion. I'd really like to open up a public UDT interface as part of Spark Datasets (SPARK-) that would allow you to register custom classes with conversions, but this won't happen till Spark 1.7 likely. On Mon, Nov 2, 2015 at 8:40 PM, Bryan Jeffrey

SparkSQL implicit conversion on insert

2015-11-02 Thread Bryan Jeffrey
All, I have an object Joda DateTime fields. I would prefer to continue to use the DateTime in my application. When I am inserting into Hive I need to cast to a Timestamp field (DateTime is not supported). I added an implicit conversion from DateTime to Timestamp - but it does not appear to be

Re: SparkSQL: What is the cost of DataFrame.registerTempTable(String)? Can I have multiple tables referencing to the same DataFrame?

2015-10-29 Thread Michael Armbrust
Its super cheap. Its just a hashtable stored on the driver. Yes you can have more than one name for the same DF. On Wed, Oct 28, 2015 at 6:17 PM, Anfernee Xu wrote: > Hi, > > I just want to understand the cost of DataFrame.registerTempTable(String), > is it just a

RE: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread java8964
' condition in SparkSQL From: anfernee...@gmail.com To: user@spark.apache.org Hi, I have a pretty large data set(2M entities) in my RDD, the data has already been partitioned by a specific key, the key has a range(type in long), now I want to create a bunch of key buckets, for example, the key has range

RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread Anfernee Xu
Hi, I have a pretty large data set(2M entities) in my RDD, the data has already been partitioned by a specific key, the key has a range(type in long), now I want to create a bunch of key buckets, for example, the key has range 1 -> 100, I will break the whole range into below buckets 1

Re: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread Anfernee Xu
QL partition by this virtual column? > > In this case, the full dataset will be just scanned once. > > Yong > > -- > Date: Thu, 29 Oct 2015 10:51:53 -0700 > Subject: RDD's filter() or using 'where' condition in SparkSQL > From: anfernee...@gmail.c

RE: RDD's filter() or using 'where' condition in SparkSQL

2015-10-29 Thread java8964
an do whatever analytic function you want. Yong Date: Thu, 29 Oct 2015 12:53:35 -0700 Subject: Re: RDD's filter() or using 'where' condition in SparkSQL From: anfernee...@gmail.com To: java8...@hotmail.com CC: user@spark.apache.org Thanks Yong for your response. Let me see if I can understand what

SparkSQL: What is the cost of DataFrame.registerTempTable(String)? Can I have multiple tables referencing to the same DataFrame?

2015-10-28 Thread Anfernee Xu
Hi, I just want to understand the cost of DataFrame.registerTempTable(String), is it just a trivial operation(like creating a object reference) in master(Driver) JVM? And Can I have multiple tables with different name referencing to the same DataFrame? Thanks -- --Anfernee

SparkSQL on hive error

2015-10-27 Thread Anand Nalya
Hi, I've a partitioned table in Hive (Avro) that I can query alright from hive cli. When using SparkSQL, I'm able to query some of the partitions, but getting exception on some of the partitions. The query is: sqlContext.sql("select * from myTable where source='http' and date = 20150812&q

RE: SparkSQL on hive error

2015-10-27 Thread Cheng, Hao
Hi Anand, can you paste the table creating statement? I’d like to reproduce that in my local first, and BTW, which version are you using? Hao From: Anand Nalya [mailto:anand.na...@gmail.com] Sent: Tuesday, October 27, 2015 11:35 PM To: spark users Subject: SparkSQL on hive error Hi, I've

Re: Anyone feels sparkSQL in spark1.5.1 very slow?

2015-10-26 Thread Yin Huai
he-spark-user-list.1001560.n3.nabble.com/Spark-1-5-1-driver-memory-problems-while-doing-Cross-Validation-do-not-occur-with-1-4-1-td25076.html > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Anyone-feels-sparkSQL-in-spark1-5-1-very

Re: Anyone feels sparkSQL in spark1.5.1 very slow?

2015-10-26 Thread filthysocks
-feels-sparkSQL-in-spark1-5-1-very-slow-tp25154p25204.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: SparkSQL: First query execution is always slower than subsequent queries

2015-10-12 Thread Xiao Li
Hi, Lloyd, Both runs are cold/warm? Memory/cache hit/miss could be a big factor if your application is IO intensive. You need to monitor your system to understand what is your bottleneck. Good lucks, Xiao Li

Re: error in sparkSQL 1.5 using count(1) in nested queries

2015-10-09 Thread Michael Armbrust
Thanks for reporting: https://issues.apache.org/jira/browse/SPARK-11032 You can probably workaround this by aliasing the count and just doing a filter on that value afterwards. On Thu, Oct 8, 2015 at 8:47 PM, Jeff Thompson < jeffreykeatingthomp...@gmail.com> wrote: > After upgrading from 1.4.1

Re: Default size of a datatype in SparkSQL

2015-10-08 Thread Michael Armbrust
Its purely for estimation, when guessing when its safe to do a broadcast join. We picked a random number that we thought was larger than the common case (its better to over estimate to avoid OOM). On Wed, Oct 7, 2015 at 10:11 PM, vivek bhaskar wrote: > I want to understand

error in sparkSQL 1.5 using count(1) in nested queries

2015-10-08 Thread Jeff Thompson
After upgrading from 1.4.1 to 1.5.1 I found some of my spark SQL queries no longer worked. Seems to be related to using count(1) or count(*) in a nested query. I can reproduce the issue in a pyspark shell with the sample code below. The ‘people’ table is from spark-1.5.1-bin-hadoop2.4/

Re: SparkSQL: First query execution is always slower than subsequent queries

2015-10-07 Thread Michael Armbrust
-dev +user 1). Is that the reason why it's always slow in the first run? Or are there > any other reasons? Apparently it loads data to memory every time so it > shouldn't be something to do with disk read should it? > You are probably seeing the effect of the JVMs JIT. The first run is

Default size of a datatype in SparkSQL

2015-10-07 Thread vivek bhaskar
I want to understand whats use of default size for a given datatype? Following link mention that its for internal size estimation. https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/DataType.html Above behavior is also reflected in code where default value seems to be used

Re: performance difference between Thrift server and SparkSQL?

2015-10-05 Thread Jeff Thompson
Thanks for the suggestion. The output from EXPLAIN is indeed equivalent in both sparkSQL and via the Thrift server. I did some more testing. The source of the performance difference is in the way I was triggering the sparkSQL query. I was using .count() instead of .collect(). When I use

Re: performance difference between Thrift server and SparkSQL?

2015-10-03 Thread Michael Armbrust
'12345'; > > When I submit the query via beeline & the JDBC thrift server it returns in > 35s > When I submit the exact same query using sparkSQL from a pyspark shell > (sqlContex.sql("SELECT * FROM ")) it returns in 3s. > > Both times are obtained from the s

performance difference between Thrift server and SparkSQL?

2015-10-03 Thread Jeff Thompson
Hi, I'm running a simple SQL query over a ~700 million row table of the form: SELECT * FROM my_table WHERE id = '12345'; When I submit the query via beeline & the JDBC thrift server it returns in 35s When I submit the exact same query using sparkSQL from a pyspark shell (sqlContex.sql(&qu

Re: SparkSQL: Reading data from hdfs and storing into multiple paths

2015-10-02 Thread Michael Armbrust
Once you convert your data to a dataframe (look at spark-csv), try df.write.partitionBy("", "mm").save("..."). On Thu, Oct 1, 2015 at 4:11 PM, haridass saisriram < haridass.saisri...@gmail.com> wrote: > Hi, > > I am trying to find a simple example to read a data file on HDFS. The > file

SparkSQL: Reading data from hdfs and storing into multiple paths

2015-10-01 Thread haridass saisriram
Hi, I am trying to find a simple example to read a data file on HDFS. The file has the following format a , b , c ,,mm a1,b1,c1,2015,09 a2,b2,c2,2014,08 I would like to read this file and store it in HDFS partitioned by year and month. Something like this /path/to/hdfs//mm I want to

How to add sparkSQL into a standalone application

2015-09-17 Thread Cui Lin
rror] val sqlContext = new org.apache.spark.sql.SQLContext(sc) [error] ^ [error] two errors found [error] (compile:compile) Compilation failed* So sparksql is not part of spark core package? I have no issue when testing my codes in spark-shell. Thanks for the help! -- Best regards! Lin,Cui

Re: How to add sparkSQL into a standalone application

2015-09-17 Thread Michael Armbrust
t/src/main/scala/TestMain.scala:19: object sql is > not a member of package org.apache.spark > [error] val sqlContext = new org.apache.spark.sql.SQLContext(sc) > [error] ^ > [error] two errors found > [error] (compile:compile) Compilati

Re: How to add sparkSQL into a standalone application

2015-09-17 Thread Michael Armbrust
org.apache.spark.sql.SQLContext; >>> [error] ^ >>> [error] /data/workspace/test/src/main/scala/TestMain.scala:19: object sql >>> is not a member of package org.apache.spark >>> [error] val sqlContext = new org.apache.spark.sql.SQLContext(sc) >>

Re: How to add sparkSQL into a standalone application

2015-09-17 Thread Cui Lin
^ >> [error] /data/workspace/test/src/main/scala/TestMain.scala:19: object sql is >> not a member of package org.apache.spark >> [error] val sqlContext = new org.apache.spark.sql.SQLContext(sc) >> [error] ^ &g

Re: Implement "LIKE" in SparkSQL

2015-09-14 Thread Jorge Sánchez
r versions of Spark, and for the operations that are >> still not supported, it's pretty straightforward to define your own >> UserDefinedFunctions in either Scala or Java (I don't know about other >> languages). >> On Sep 11, 2015 10:26 PM, "liam" <liaml...@gmail

Re: UDAF and UDT with SparkSQL 1.5.0

2015-09-13 Thread jussipekkap
.scala:132) at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:719) at com.eaglepeaks.engine.SparkEngine.main(SparkEngine.java:114) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/UDAF-and-UDT-with-

UDAF and UDT with SparkSQL 1.5.0

2015-09-12 Thread jussipekkap
-spark-user-list.1001560.n3.nabble.com/UDAF-and-UDT-with-SparkSQL-1-5-0-tp24670.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Re: Implement "LIKE" in SparkSQL

2015-09-12 Thread liam
or Java (I don't know about other > languages). > On Sep 11, 2015 10:26 PM, "liam" <liaml...@gmail.com> wrote: > >> Hi, >> >> Imaging this: the value of one column is the substring of another >> column, when using Oracle,I got many ways to do the query like the >> following statement,but how to do in SparkSQL since this no "concat(), >> instr(), locate()..." >> >> >> select * from table t where t.a like '%'||t.b||'%'; >> >> >> Thanks. >> >>

sparksql query hive data error

2015-09-11 Thread stark_summer
ect * from pokes; command ,it will OK ~,I can not understand~ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sparksql-query-hive-data-error-tp24654.html Sent from the Apache Spark User List mailing list archive at

Re: Implement "LIKE" in SparkSQL

2015-09-11 Thread Richard Eggert
). On Sep 11, 2015 10:26 PM, "liam" <liaml...@gmail.com> wrote: > Hi, > > Imaging this: the value of one column is the substring of another > column, when using Oracle,I got many ways to do the query like the > following statement,but how to do in SparkSQL s

SparkSQL without access to arrays?

2015-09-03 Thread Terry
: There is an alternative for this work? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-without-access-to-arrays-tp24572.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: SparkSQL saveAsParquetFile does not preserve AVRO schema

2015-08-26 Thread storm
is data.schema.asNullable. What's the real reason for this? Why not simply use the existing schema nullable flags? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-saveAsParquetFile-does-not-preserve-AVRO-schema-tp2p24454.html Sent from

SparkSQL problem with IBM BigInsight V3

2015-08-25 Thread java8964
Hi, On our production environment, we have a unique problems related to Spark SQL, and I wonder if anyone can give me some idea what is the best way to handle this. Our production Hadoop cluster is IBM BigInsight Version 3, which comes with Hadoop 2.2.0 and Hive 0.12. Right now, we build spark

Select some data from Hive (SparkSQL) directly using NodeJS

2015-08-25 Thread Phakin Cheangkrachange
Hi, I just wonder if there's any way that I can get some sample data (10-20 rows) out of Spark's Hive using NodeJs? Submitting a spark job to show 20 rows of data in web page is not good for me. I've set up Spark Thrift Server as shown in Spark Doc. The server works because I can use *beeline*

SparkSQL saveAsParquetFile does not preserve AVRO schema

2015-08-25 Thread storm
in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-saveAsParquetFile-does-not-preserve-AVRO-schema-tp2.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user

Re: SparkSQL concerning materials

2015-08-23 Thread Michael Armbrust
://spark.apache.org/docs/latest/sql-programming-guide.html Regards Muhammad On Thu, Aug 20, 2015 at 5:46 AM, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: Hi, I would like to dip into SparkSQL. Get to know better the architecture, good practices, some internals. Could you advise me some

Re: SparkSQL concerning materials

2015-08-21 Thread Sameer Farooqui
started is the Spark SQL Guide from Apache http://spark.apache.org/docs/latest/sql-programming-guide.html Regards Muhammad On Thu, Aug 20, 2015 at 5:46 AM, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: Hi, I would like to dip into SparkSQL. Get to know better the architecture, good

Re: SparkSQL concerning materials

2015-08-21 Thread Dawid Wysakowicz
...@gmail.com wrote: Hi, I would like to dip into SparkSQL. Get to know better the architecture, good practices, some internals. Could you advise me some materials on this matter? Regards Dawid

SparkSQL concerning materials

2015-08-20 Thread Dawid Wysakowicz
Hi, I would like to dip into SparkSQL. Get to know better the architecture, good practices, some internals. Could you advise me some materials on this matter? Regards Dawid

Re: SparkSQL concerning materials

2015-08-20 Thread Muhammad Atif
Hi Dawid The best pace to get started is the Spark SQL Guide from Apache http://spark.apache.org/docs/latest/sql-programming-guide.html Regards Muhammad On Thu, Aug 20, 2015 at 5:46 AM, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: Hi, I would like to dip into SparkSQL. Get to know

Re: SparkSQL concerning materials

2015-08-20 Thread Ted Yu
/sql-programming-guide.html Regards Muhammad On Thu, Aug 20, 2015 at 5:46 AM, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: Hi, I would like to dip into SparkSQL. Get to know better the architecture, good practices, some internals. Could you advise me some materials on this matter

Re: SparkSQL concerning materials

2015-08-20 Thread Dhaval Patel
wysakowicz.da...@gmail.com wrote: Hi, I would like to dip into SparkSQL. Get to know better the architecture, good practices, some internals. Could you advise me some materials on this matter? Regards Dawid

Re: Questions about SparkSQL join on not equality conditions

2015-08-11 Thread gen tang
question about sparksql's implementation of join on not equality conditions, for instance condition1 or condition2. In fact, Hive doesn't support such join, as it is very difficult to express such conditions as a map/reduce job. However, sparksql supports such operation. So I would like to know how

Re: Questions about SparkSQL join on not equality conditions

2015-08-10 Thread gen tang
implementation of join on not equality conditions, for instance condition1 or condition2. In fact, Hive doesn't support such join, as it is very difficult to express such conditions as a map/reduce job. However, sparksql supports such operation. So I would like to know how spark implement

Questions about SparkSQL join on not equality conditions

2015-08-09 Thread gen tang
Hi, I might have a stupid question about sparksql's implementation of join on not equality conditions, for instance condition1 or condition2. In fact, Hive doesn't support such join, as it is very difficult to express such conditions as a map/reduce job. However, sparksql supports such operation

Re: SparkSQL: add jar blocks all queries

2015-08-07 Thread Wu, James C.
Hi, The issue only seems to happen when trying to access spark via the SparkSQL Thrift Server interface. Does anyone know a fix? james From: Wu, Walt Disney james.c...@disney.commailto:james.c...@disney.com Date: Friday, August 7, 2015 at 12:40 PM To: user@spark.apache.orgmailto:user

SparkSQL: remove jar added by add jar command from dependencies

2015-08-07 Thread Wu, James C.
Hi, I am using Spark SQL to run some queries on a set of avro data. Somehow I am getting this error 0: jdbc:hive2://n7-z01-0a2a1453 select count(*) from flume_test; Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 26.0 failed 4 times, most recent

SparkSQL: add jar blocks all queries

2015-08-07 Thread Wu, James C.
@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: SparkSQL: remove jar added by add jar command from dependencies Hi, I am using Spark SQL to run some queries on a set of avro data. Somehow I am getting this error 0: jdbc:hive2://n7-z01-0a2a1453

Re: HiveQL to SparkSQL

2015-08-03 Thread Bigdata techguy
Did anybody try to convert HiveQL queries to SparkSQL? If so, would you share the experience, pros cons please? Thank you. On Thu, Jul 30, 2015 at 10:37 AM, Bigdata techguy bigdatatech...@gmail.com wrote: Thanks Jorn for the response and for the pointer questions to Hive optimization tips

Re: HiveQL to SparkSQL

2015-07-30 Thread Bigdata techguy
is happening, using compression, using the best data types for join columns, denormalizing etc:. I am using Hive version - 0.13. The idea behind this POC is to find the strengths of SparkSQL over HiveQL and identify the use cases where SparkSQL can perform better than HiveQL other than

HiveQL to SparkSQL

2015-07-29 Thread Bigdata techguy
Hi All, I have a fairly complex HiveQL data processing which I am trying to convert to SparkSQL to improve performance. Below is what it does. Select around 100 columns including Aggregates From a FACT_TABLE Joined to the summary of the same FACT_TABLE Joined to 2 smaller DIMENSION tables

Re: HiveQL to SparkSQL

2015-07-29 Thread Jörn Franke
trying to convert to SparkSQL to improve performance. Below is what it does. Select around 100 columns including Aggregates From a FACT_TABLE Joined to the summary of the same FACT_TABLE Joined to 2 smaller DIMENSION tables. The data processing currently takes around an hour to complete

Controlling output fileSize in SparkSQL

2015-07-27 Thread Tim Smith
Hi, I am using Spark 1.3 (CDH 5.4.4). What's the recipe for setting a minimum output file size when writing out from SparkSQL? So far, I have tried: --x- import sqlContext.implicits._ sc.hadoopConfiguration.setBoolean(fs.hdfs.impl.disable.cache,true) sc.hadoopConfiguration.setLong

Need help in SparkSQL

2015-07-22 Thread Jeetendra Gangele
HI All, I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex queries analysis on this data.Queries like AND queries involved multiple fields So my question in which which format I should store the data in HDFS so that processing will be fast for such kind of queries?

Re: Need help in SparkSQL

2015-07-22 Thread Jörn Franke
Can you provide an example of an and query ? If you do just look-up you should try Hbase/ phoenix, otherwise you can try orc with storage index and/or compression, but this depends on how your queries look like Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele gangele...@gmail.com a écrit : HI

Re: Need help in SparkSQL

2015-07-22 Thread Jörn Franke
I do not think you can put all your queries into the row key without duplicating the data for each query. However, this would be more last resort. Have you checked out phoenix for Hbase? This might suit your needs. It makes it much simpler, because it provided sql on top of Hbase. Nevertheless,

Re: Need help in SparkSQL

2015-07-22 Thread Jeetendra Gangele
Query will be something like that 1. how many users visited 1 BHK flat in last 1 hour in given particular area 2. how many visitor for flats in give area 3. list all user who bought given property in last 30 days Further it may go too complex involving multiple parameters in my query. The

RE: Need help in SparkSQL

2015-07-22 Thread Mohammed Guller
Parquet Mohammed From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: Wednesday, July 22, 2015 5:48 AM To: user Subject: Need help in SparkSQL HI All, I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex queries analysis on this data.Queries like AND queries

Re: SparkSQL 1.4 can't accept registration of UDF?

2015-07-16 Thread Okehee Goh
if there is any difference At 2015-07-15 08:10:44, ogoh oke...@gmail.com wrote: Hello, I am using SparkSQL along with ThriftServer so that we can access using Hive queries. With Spark 1.3.1, I can register UDF function. But, Spark 1.4.0 doesn't work for that. The jar of the udf is same. Below is logs: I

SparkSQL 1.4 can't accept registration of UDF?

2015-07-14 Thread ogoh
Hello, I am using SparkSQL along with ThriftServer so that we can access using Hive queries. With Spark 1.3.1, I can register UDF function. But, Spark 1.4.0 doesn't work for that. The jar of the udf is same. Below is logs: I appreciate any advice. == With Spark 1.4 Beeline version 1.4.0

Re: SparkSQL 1.4 can't accept registration of UDF?

2015-07-14 Thread Okehee Goh
, Jul 14, 2015 at 8:46 PM, prosp4300 prosp4...@163.com wrote: What's the result of list jar in both 1.3.1 and 1.4.0, please check if there is any difference At 2015-07-15 08:10:44, ogoh oke...@gmail.com wrote: Hello, I am using SparkSQL along with ThriftServer so that we can access using

Do SparkSQL support subquery?

2015-07-13 Thread Louis Hust
Hi, all I am using spark 1.4, and find some sql is not support, especially the subquery, such as subquery in select items, in where clause, and in predicate conditions. So i want to know if spark support subquery or i am in the wrong way using spark sql? If not support subquery, is there a plan

Re: Do SparkSQL support subquery?

2015-07-13 Thread ayan guha
In Jira, it says in progress https://issues.apache.org/jira/browse/SPARK-4226 On Mon, Jul 13, 2015 at 11:10 PM, Louis Hust louis.h...@gmail.com wrote: Hi, all I am using spark 1.4, and find some sql is not support, especially the subquery, such as subquery in select items, in where clause,

Re: SparkSQL 'describe table' tries to look at all records

2015-07-13 Thread Yana Kadiyska
to query the table. The one you are looking for is df.printSchema() On Mon, Jul 13, 2015 at 10:03 AM, Jerrick Hoang jerrickho...@gmail.com wrote: Hi all, I'm new to Spark and this question may be trivial or has already been answered, but when I do a 'describe table' from SparkSQL CLI it seems

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Jerrick Hoang
, Jerrick Hoang jerrickho...@gmail.com wrote: Hi all, I'm new to Spark and this question may be trivial or has already been answered, but when I do a 'describe table' from SparkSQL CLI it seems to try looking at all records at the table (which takes a really long time for big table) instead

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Ted Yu
Which Spark release do you use ? Cheers On Sun, Jul 12, 2015 at 5:03 PM, Jerrick Hoang jerrickho...@gmail.com wrote: Hi all, I'm new to Spark and this question may be trivial or has already been answered, but when I do a 'describe table' from SparkSQL CLI it seems to try looking at all

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread ayan guha
a 'describe table' from SparkSQL CLI it seems to try looking at all records at the table (which takes a really long time for big table) instead of just giving me the metadata of the table. Would appreciate if someone can give me some pointers, thanks! -- Best Regards, Ayan Guha

Re: SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Yin Huai
it will try to query the table. The one you are looking for is df.printSchema() On Mon, Jul 13, 2015 at 10:03 AM, Jerrick Hoang jerrickho...@gmail.com wrote: Hi all, I'm new to Spark and this question may be trivial or has already been answered, but when I do a 'describe table' from SparkSQL CLI

SparkSQL 'describe table' tries to look at all records

2015-07-12 Thread Jerrick Hoang
Hi all, I'm new to Spark and this question may be trivial or has already been answered, but when I do a 'describe table' from SparkSQL CLI it seems to try looking at all records at the table (which takes a really long time for big table) instead of just giving me the metadata of the table. Would

Fwd: SparkSQL Postgres balanced partition of DataFrames

2015-07-10 Thread Moises Baly
Hi, I have a very simple setup of SparkSQL connecting to a Postgres DB and I'm trying to get a DataFrame from a table, the Dataframe with a number X of partitions (lets say 2). The code would be the following: MapString, String options = new HashMapString, String(); options.put(url, DB_URL

[SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Yana Kadiyska
Hi folks, I just re-wrote a query from using UNION ALL to use with rollup and I'm seeing some unexpected behavior. I'll open a JIRA if needed but wanted to check if this is user error. Here is my code: case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 50).map(i=KeyValue(i,

Re: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread ayan guha
Can you please post result of show()? On 10 Jul 2015 01:00, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, I just re-wrote a query from using UNION ALL to use with rollup and I'm seeing some unexpected behavior. I'll open a JIRA if needed but wanted to check if this is user error. Here

Re: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Yana Kadiyska
+---+---+---+ |cnt|_c1|grp| +---+---+---+ | 1| 31| 0| | 1| 31| 1| | 1| 4| 0| | 1| 4| 1| | 1| 42| 0| | 1| 42| 1| | 1| 15| 0| | 1| 15| 1| | 1| 26| 0| | 1| 26| 1| | 1| 37| 0| | 1| 10| 0| | 1| 37| 1| | 1| 10| 1| | 1| 48| 0| | 1| 21| 0| | 1| 48| 1| | 1| 21| 1| |

RE: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Cheng, Hao
Never mind, I’ve created the jira issue at https://issues.apache.org/jira/browse/SPARK-8972. From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Friday, July 10, 2015 9:15 AM To: yana.kadiy...@gmail.com; ayan guha Cc: user Subject: RE: [SparkSQL] Incorrect ROLLUP results Yes, this is a bug, do

<    1   2   3   4   5   6   7   8   9   10   >