Bizzare diff in behavior between scala REPL and sparkSQL UDF

2017-06-20 Thread jeff saremi
I have this function which does a regex matching in scala. I test it in the REPL I get expected results. I use it as a UDF in sparkSQL i get completely incorrect results. Function: class UrlFilter (filters: Seq[String]) extends Serializable { val regexFilters = filters.map(new Regex

RE: [SparkSQL] Escaping a query for a dataframe query

2017-06-16 Thread mark.jenki...@baesystems.com
e$.error(package.scala:27) at org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36) From: Gourav Sengupta [mailto:gourav.sengu...@gmail.com] Sent: 15 June 2017 19:35 To: Michael Mior Cc: Jenkins, Mark (UK Guildford); user@spark.apache.org Subject: Re: [SparkSQL] Escaping a qu

Re: [SparkSQL] Escaping a query for a dataframe query

2017-06-15 Thread Gourav Sengupta
It might be something that I am saying wrong but sometimes it may just make sense to see the difference between *” *and " <”> 8221, Hex 201d, Octal 20035 <"> 34, Hex 22, Octal 042 Regards, Gourav On Thu, Jun 15, 2017 at 6:45 PM, Michael Mior wrote: > Assuming the

Re: [SparkSQL] Escaping a query for a dataframe query

2017-06-15 Thread Michael Mior
Assuming the parameter to your UDF should be start"end (with a quote in the middle) then you need to insert a backslash into the query (which must also be escaped in your code). So just add two extra backslashes before the quote inside the string. sqlContext.sql("SELECT * FROM mytable WHERE

[SparkSQL] Escaping a query for a dataframe query

2017-06-15 Thread mark.jenki...@baesystems.com
Hi, I have a query sqlContext.sql("SELECT * FROM mytable WHERE (mycolumn BETWEEN 1 AND 2) AND (myudfsearchfor(\"start\"end\"))" How should I escape the double quote so that it successfully parses? I know I can use single quotes but I do not want to since I may need to search for a single

Re: SparkSQL not able to read a empty table location

2017-05-21 Thread Bajpai, Amit X. -ND
set spark.sql.hive.verifyPartitionPath=true didn’t help. Still getting the same error. I tried to copy a file with a _ prefix and I am not getting the error and the file is also ignored by SparkSQL. But when scheduling the job in prod and if during one execution there is no data

Re: SparkSQL not able to read a empty table location

2017-05-21 Thread Sea
ark.apache.org"<user@spark.apache.org>; Subject: Re: SparkSQL not able to read a empty table location On 20 May 2017, at 01:44, Bajpai, Amit X. -ND <amit.x.bajpai@disney.com> wrote: Hi, I have a hive external table with the S3 location having no files (but the

Re: SparkSQL not able to read a empty table location

2017-05-20 Thread Steve Loughran
On 20 May 2017, at 01:44, Bajpai, Amit X. -ND > wrote: Hi, I have a hive external table with the S3 location having no files (but the S3 location directory does exists). When I am trying to use Spark SQL to count the number of records in

SparkSQL not able to read a empty table location

2017-05-19 Thread Bajpai, Amit X. -ND
Hi, I have a hive external table with the S3 location having no files (but the S3 location directory does exists). When I am trying to use Spark SQL to count the number of records in the table it is throwing error saying “File s3n://data/xyz does not exist. null/0”. select * from tablex limit

Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Edward Capriolo
Here is a similar but not exact way I did something similar to what you did. I had two data files in different formats the different columns needed to be different features. I wanted to feed them into spark's:

Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread ayan guha
You may consider writing all your data to a nosql datastore such as hbase, using user id as key. There is a sql solution using max and inner case and finally union the results, but that may be expensive On Tue, 16 May 2017 at 12:13 am, Didac Gil wrote: > Or maybe you

Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Didac Gil
Or maybe you could also check using the collect_list from the SQL functions val compacter = Data1.groupBy(“UserID") .agg(org.apache.spark.sql.functions.collect_list(“feature").as(“ListOfFeatures")) > On 15 May 2017, at 15:15, Jone Zhang wrote: > > For example >

Re: How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Didac Gil
I guess that if your user_id field is the key, you could use the updateStateByKey function. I did not test it, but it could be something along these lines: def yourCombineFunction(input: Seq[(String)],accumulatedInput: Option[(String)] = { val state = accumulatedInput.getOrElse((“”))

How can i merge multiple rows to one row in sparksql or hivesql?

2017-05-15 Thread Jone Zhang
For example Data1(has 1 billion records) user_id1 feature1 user_id1 feature2 Data2(has 1 billion records) user_id1 feature3 Data3(has 1 billion records) user_id1 feature4 user_id1 feature5 ... user_id1 feature100 I want to get the result as follow user_id1 feature1 feature2 feature3

Feasability limits of joins in SparkSQL (Why does my driver explode with a large number of joins?)

2017-04-11 Thread Rick Moritz
Hi List, I'm currently trying to naively implement a Data-Vault-type Data-Warehouse using SparkSQL, and was wondering whether there's an inherent practical limit to query complexity, beyond which SparkSQL will stop functioning, even for relatively small amounts of data. I'm currently looking

[SparkSQL] Project using NamedExpression

2017-03-21 Thread Aviral Agarwal
Hi guys, I want transform Row using NamedExpression. Below is the code snipped that I am using : def apply(dataFrame: DataFrame, selectExpressions: java.util.List[String]): RDD[UnsafeRow] = { val exprArray = selectExpressions.map(s => Column(SqlParser.parseExpression(s)).named )

Re: [SparkSQL] too many open files although ulimit set to 1048576

2017-03-13 Thread darin
I think your sets not works try add `ulimit -n 10240 ` in spark-env.sh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-too-many-open-files-although-ulimit-set-to-1048576-tp28490p28491.html Sent from the Apache Spark User List mailing list archive

答复: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Linyuxin
apila.pl>; user <user@spark.apache.org> 主题: RE: [SparkSQL] pre-check syntex before running spark job? Hi, you can use spark sql Antlr grammer for pre check you syntax. https://github.com/apache/spark/blob/acf71c63cdde8dced8d108260cdd35e1cc992248/sql/catalyst/src/main/antlr4/org/apache/spark/sql/cat

RE: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Gurdit Singh
, 2017 7:34 AM To: Irving Duran <irving.du...@gmail.com>; Yong Zhang <java8...@hotmail.com> Cc: Jacek Laskowski <ja...@japila.pl>; user <user@spark.apache.org> Subject: 答复: [SparkSQL] pre-check syntex before running spark job? Actually,I want a standalone jar as I can check

答复: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Linyuxin
m>; user <user@spark.apache.org> 主题: Re: [SparkSQL] pre-check syntex before running spark job? You can also run it on REPL and test to see if you are getting the expected result. Thank You, Irving Duran On Tue, Feb 21, 2017 at 8:01 AM, Yong Zhang <java8...@hotmail.com<mailto:java8...

Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Irving Duran
. > > > Yong > > > -- > *From:* Jacek Laskowski <ja...@japila.pl> > *Sent:* Tuesday, February 21, 2017 4:34 AM > *To:* Linyuxin > *Cc:* user > *Subject:* Re: [SparkSQL] pre-check syntex before running spark job? > > Hi, > > Never he

Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Yong Zhang
You can always use explain method to validate your DF or SQL, before any action. Yong From: Jacek Laskowski <ja...@japila.pl> Sent: Tuesday, February 21, 2017 4:34 AM To: Linyuxin Cc: user Subject: Re: [SparkSQL] pre-check syntex before running spark jo

Re: [SparkSQL] pre-check syntex before running spark job?

2017-02-21 Thread Jacek Laskowski
Hi, Never heard about such a tool before. You could use Antlr to parse SQLs (just as Spark SQL does while parsing queries). I think it's a one-hour project. Jacek On 21 Feb 2017 4:44 a.m., "Linyuxin" wrote: Hi All, Is there any tool/api to check the sql syntax without

[SparkSQL] pre-check syntex before running spark job?

2017-02-20 Thread Linyuxin
Hi All, Is there any tool/api to check the sql syntax without running spark job actually? Like the siddhiQL on storm here: SiddhiManagerService. validateExecutionPlan https://github.com/wso2/siddhi/blob/master/modules/siddhi-core/src/main/java/org/wso2/siddhi/core/SiddhiManagerService.java it

Is RAND() in SparkSQL deterministic when used on MySql data sources?

2017-01-12 Thread Gabriele Del Prete
indeed safely assume that now rand() will be deterministic, or does the source of non-deterministic behavior lie in the Spark SQL engine rather than the specific datasource ? Gabriele -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-RAND-in-SparkSQL

Re: Nested ifs in sparksql

2017-01-11 Thread Raghavendra Pandey
I am not using case when. It is mostly IF. By slow, I mean 6 min even for 10 records for 41 level nested ifs. On Jan 11, 2017 3:31 PM, "Georg Heiler" wrote: > I was using the dataframe api not sql. The main problem was that too much > code was generated. > Using an

Re: Nested ifs in sparksql

2017-01-11 Thread Georg Heiler
I was using the dataframe api not sql. The main problem was that too much code was generated. Using an unforgettable turned out to be quicker as well. Olivier Girardot schrieb am Di. 10. Jan. 2017 um 21:54: > Are you using the "case when" functions ? what do you

Re: Nested ifs in sparksql

2017-01-10 Thread Olivier Girardot
Are you using the "case when" functions ? what do you mean by slow ? can you share a snippet ? On Tue, Jan 10, 2017 8:15 PM, Georg Heiler georg.kf.hei...@gmail.com wrote: Maybe you can create an UDF? Raghavendra Pandey schrieb am Di., 10. Jan. 2017 um 20:04 

Re: Nested ifs in sparksql

2017-01-10 Thread Georg Heiler
Maybe you can create an UDF? Raghavendra Pandey schrieb am Di., 10. Jan. 2017 um 20:04 Uhr: > I have of around 41 level of nested if else in spark sql. I have > programmed it using apis on dataframe. But it takes too much time. > Is there anything I can do to

Nested ifs in sparksql

2017-01-10 Thread Raghavendra Pandey
I have of around 41 level of nested if else in spark sql. I have programmed it using apis on dataframe. But it takes too much time. Is there anything I can do to improve on time here?

Re: Query in SparkSQL

2016-12-12 Thread vaquar khan
ing on SpqrkSQL using hiveContext (version 1.6.2). > > Can I run following queries directly in sparkSQL, if yes how > > > > update calls set sample = 'Y' where accnt_call_id in (select accnt_call_id > from samples); > > > > insert into details (accnt_call_id, prdct_cd

Query in SparkSQL

2016-12-12 Thread Niraj Kumar
Hi I am working on SpqrkSQL using hiveContext (version 1.6.2). Can I run following queries directly in sparkSQL, if yes how update calls set sample = 'Y' where accnt_call_id in (select accnt_call_id from samples); insert into details (accnt_call_id, prdct_cd, prdct_id, dtl_pstn) select

SparkSQL

2016-12-09 Thread Niraj Kumar
Hi I am working on SpqrkSQL using hiveContext (version 1.6.2). Can someone help me to convert following queries in sparkSQL. update calls set sample = 'Y' where accnt_call_id in (select accnt_call_id from samples); insert into details (accnt_call_id, prdct_cd, prdct_id, dtl_pstn) select

ClassCastException when using SparkSQL Window function

2016-11-17 Thread Isabelle Phan
Hello, I have a simple session table, which tracks pages users visited with a sessionId. I would like to apply a window function by sessionId, but am hitting a type cast exception. I am using Spark 1.5.0. Here is sample code: scala> df.printSchema root |-- sessionid: string (nullable = true)

SparkSQL: intra-SparkSQL-application table registration

2016-11-14 Thread Mohamed Nadjib Mami
lication without registering the tables that have already been registered before)." [1]: http://stackoverflow.com/questions/40549924/sparksql-intra-sparksql-application-table-registration Cheers, Mohamed

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Koert Kuipers
okay i see the partition local sort. got it. i would expect that pushing the partition local sort into shuffle would give a signficicant boost. but thats just a guess. On Fri, Nov 4, 2016 at 2:39 PM, Michael Armbrust wrote: > sure, but then my values are not sorted per

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Michael Armbrust
> > sure, but then my values are not sorted per key, right? It does do a partition local sort. Look at the query plan in my example .

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Koert Kuipers
i just noticed Sort for Dataset has a global flag. and Dataset also has sortWithinPartitions. how about: repartition + sortWithinPartitions + mapPartitions? the plan looks ok, but it is not clear to me if the sort is done as part of the shuffle (which is the important optimization). scala> val

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-04 Thread Koert Kuipers
sure, but then my values are not sorted per key, right? so a group by key with values sorted according to to some ordering is an operation that can be done efficiently in a single shuffle without first figuring out range boundaries. and it is needed for quite a few algos, including Window and

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
Thinking out loud is good :) You are right in that anytime you ask for a global ordering from Spark you will pay the cost of figuring out the range boundaries for partitions. If you say orderBy, though, we aren't sure that you aren't expecting a global order. If you only want to make sure that

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
i guess i could sort by (hashcode(key), key, secondarySortColumn) and then do mapPartitions? sorry thinking out loud a bit here. ok i think that could work. thanks On Thu, Nov 3, 2016 at 10:25 PM, Koert Kuipers wrote: > thats an interesting thought about orderBy and

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
thats an interesting thought about orderBy and mapPartitions. i guess i could emulate a groupBy with secondary sort using those two. however isn't using an orderBy expensive since it is a total sort? i mean a groupBy with secondary sort is also a total sort under the hood, but its on

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
> > It is still unclear to me why we should remember all these tricks (or add > lots of extra little functions) when this elegantly can be expressed in a > reduce operation with a simple one line lamba function. > I think you can do that too. KeyValueGroupedDataset has a reduceGroups function.

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
Oh okay that makes sense. The trick is to take max on tuple2 so you carry the other column along. It is still unclear to me why we should remember all these tricks (or add lots of extra little functions) when this elegantly can be expressed in a reduce operation with a simple one line lamba

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Michael Armbrust
You are looking to perform an *argmax*, which you can do with a single aggregation. Here is an example . On Thu, Nov 3, 2016 at 4:53

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Robin East
I agree with Koert. Relying on something because it appears to work when you test it can be dangerous if there is nothing in the api guarantee. Going back quite a few years it used to be the case that Oracle would always return a group by with the rows in the order of the grouping key. This

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
i did not check the claim in that blog post that the data is ordered, but i wouldnt rely on that behavior since it is not something the api guarantees and could change in future versions On Thu, Nov 3, 2016 at 9:59 AM, Rabin Banerjee wrote: > Hi Koert & Robin , >

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread ayan guha
I would go for partition by option. It seems simple and yes, SQL inspired :) On 4 Nov 2016 00:59, "Rabin Banerjee" wrote: > Hi Koert & Robin , > > * Thanks ! *But if you go through the blog https://bzhangusc. >

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Rabin Banerjee
Hi Koert & Robin , * Thanks ! *But if you go through the blog https://bzhangusc.wordpress.co m/2015/05/28/groupby-on-dataframe-is-not-the-groupby-on-rdd/ and check the comments under the blog it's actually working, although I am not sure how . And yes I agree a custom aggregate UDAF is a good

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
Just realized you only want to keep first element. You can do this without sorting by doing something similar to min or max operation using a custom aggregator/udaf or reduceGroups on Dataset. This is also more efficient. On Nov 3, 2016 7:53 AM, "Rabin Banerjee"

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Koert Kuipers
What you require is secondary sort which is not available as such for a DataFrame. The Window operator is what comes closest but it is strangely limited in its abilities (probably because it was inspired by a SQL construct instead of a more generic programmatic transformation capability). On Nov

Re: Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Robin East
I don’t think the semantics of groupBy necessarily preserve ordering - whatever the implementation details or the observed behaviour. I would use a Window operation and order within the group. > On 3 Nov 2016, at 11:53, Rabin Banerjee wrote: > > Hi All , > >

Confusion SparkSQL DataFrame OrderBy followed by GroupBY

2016-11-03 Thread Rabin Banerjee
Hi All , I want to do a dataframe operation to find the rows having the latest timestamp in each group using the below operation

SparkSQL with Hive got "java.lang.NullPointerException"

2016-11-03 Thread lxw
Hi, exports: I use SparkSQL to query Hive tables, this query throws NPE, but run OK with Hive. SELECT city FROM ( SELECT city FROM t_ad_fact a WHERE a.pt = '2016-10-10' limit 100 ) x GROUP BY city; Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org

Re: LIMIT issue of SparkSQL

2016-10-29 Thread Asher Krim
We have also found LIMIT to take an unacceptable amount of time when reading parquet formatted data from s3. LIMIT was not strictly needed for our usecase, so we worked around it -- Asher Krim Senior Software Engineer On Fri, Oct 28, 2016 at 5:36 AM, Liz Bai wrote: > Sorry

Re: LIMIT issue of SparkSQL

2016-10-28 Thread Liz Bai
Sorry for the late reply. The size of the raw data is 20G and it is composed of two columns. We generated it by this . The test queries are very simple, 1). select ColA from Table limit 1 2). select ColA from Table

Re: Using Hive UDTF in SparkSQL

2016-10-27 Thread Davies Liu
Could you file a JIRA for this bug? On Thu, Oct 27, 2016 at 3:05 AM, Lokesh Yadav wrote: > Hello > > I am trying to use a Hive UDTF function in spark SQL. But somehow its not > working for me as intended and I am not able to understand the behavior. > > When I try to

Using Hive UDTF in SparkSQL

2016-10-27 Thread Lokesh Yadav
Hello I am trying to use a Hive UDTF function in spark SQL. But somehow its not working for me as intended and I am not able to understand the behavior. When I try to register a function like this: create temporary function SampleUDTF_01 as 'com.fl.experiments.sparkHive.SampleUDTF' using JAR

Is there length limit for sparksql/hivesql?

2016-10-26 Thread Jone Zhang
Is there length limit for sparksql/hivesql? Can antlr work well if sql is too long? Thanks.

Re: LIMIT issue of SparkSQL

2016-10-24 Thread Michael Armbrust
It is not about limits on specific tables. We do support that. The case I'm describing involves pushing limits across system boundaries. It is certainly possible to do this, but the current datasource API does provide this information (other than the implicit limit that is pushed down to the

Re: LIMIT issue of SparkSQL

2016-10-24 Thread Mich Talebzadeh
This is an interesting point. As far as I know in any database (practically all RDBMS Oracle, SAP etc), the LIMIT affects the collection part of the result set. The result set is carried out fully on the query that may involve multiple joins on multiple underlying tables. To limit the actual

Re: LIMIT issue of SparkSQL

2016-10-23 Thread Michael Armbrust
- dev + user Can you give more info about the query? Maybe a full explain()? Are you using a datasource like JDBC? The API does not currently push down limits, but the documentation talks about how you can use a query instead of a table if that is what you are looking to do. On Mon, Oct 24,

Re: Unable to compare SparkSQL Date columns

2016-09-13 Thread Praseetha
Hi Mich, Even i'm getting similar output. The dates that are passed as input are different from the one in the output. Since its an inner join, the expected result is [2015-12-31,2015-12-31,1,105] [2016-01-27,2016-01-27,5,101] Thanks & Regds, --Praseetha On Tue, Sep 13, 2016 at 11:21 PM, Mich

Re: Unable to compare SparkSQL Date columns

2016-09-13 Thread Mich Talebzadeh
Hi Praseetha, This is how I have written this. case class TestDate (id: String, loginTime: java.sql.Date) val formate = new SimpleDateFormat("-MM-DD") val TestDateData = sc.parallelize(List( ("1", new java.sql.Date(formate.parse("2016-01-31").getTime)), ("2", new

Re: Unable to compare SparkSQL Date columns

2016-09-13 Thread Mich Talebzadeh
Hi Praseetha. :32: error: not found: value formate Error occurred in an application involving default arguments. ("1", new java.sql.Date(formate.parse("2016-01-31").getTime)), What is that formate? Thanks Dr Mich Talebzadeh LinkedIn *

Re: Unable to compare SparkSQL Date columns

2016-09-13 Thread Praseetha
Hi Mich, Thanks a lot for your reply. Here is the sample case class TestDate (id: String, loginTime: java.sql.Date) val formate = new SimpleDateFormat("-MM-DD") val TestDateData = sc.parallelize(List( ("1", new java.sql.Date(formate.parse("2016-01-31").getTime)),

Re: Unable to compare SparkSQL Date columns

2016-09-13 Thread Mich Talebzadeh
Can you send the rdds that just creates those two dates? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Unable to compare SparkSQL Date columns

2016-09-13 Thread Praseetha
Hi All, I have a case class in scala case class TestDate (id: String, loginTime: java.sql.Date) I created 2 RDD's of type TestDate I wanted to do an inner join on two rdd's where the values of loginTime column is equal. Please find the code snippet below,

Re: SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-10 Thread Mich Talebzadeh
right let us simplify this. can you run the whole thing *once* only and send dag execution output from UI? you can use snipping tool to take the image. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-10 Thread Rabin Banerjee
Hi , 1. You are doing some analytics I guess? *YES* 2. It is almost impossible to guess what is happening except that you are looping 50 times over the same set of sql? *I am Not Looping any SQL, All SQLs are called exactly once , which requires output from prev SQL.* 3. Your

Re: SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-10 Thread Mich Talebzadeh
Hi 1. You are doing some analytics I guess? 2. It is almost impossible to guess what is happening except that you are looping 50 times over the same set of sql? 3. Your sql step n depends on step n-1. So spark cannot get rid of 1 -n steps 4. you are not storing anything in

SparkSQL DAG generation , DAG optimization , DAG execution

2016-09-09 Thread Rabin Banerjee
HI All, I am writing and executing a Spark Batch program which only use SPARK-SQL , But it is taking lot of time and finally giving GC overhead . Here is the program , 1.Read 3 files ,one medium size and 2 small files, and register them as DF. 2. fire sql with complex aggregation and

回复:[SparkSQL+SparkStreaming]SparkStreaming APP can not load data into SparkSQL table

2016-09-05 Thread luohui20001
the data can be written as parquet into HDFS. But the loading data process is not working as expected. ThanksBest regards! San.Luo - 原始邮件 - 发件人:<luohui20...@sina.com> 收件人:"user" <user@spark.apache.org> 主题:[SparkSQL+SparkStream

[SparkSQL+SparkStreaming]SparkStreaming APP can not load data into SparkSQL table

2016-09-05 Thread luohui20001
hi guys: I got a question that my SparkStreaming APP can not loading data into SparkSQL table in. Here is my code: val conf = new SparkConf().setAppName("KafkaStreaming for " + topics).setMaster("spark://master60:7077") val storageLevel = StorageLevel.DISK_ON

Re: Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread ayan guha
on to everyone. >>> >>> >>> >>> We are looking forward to: >>> >>> 1) A *solution or a work around, by which we can give secure >>> access only to the selected users to sensitive tables/database.* >>> >>> 2)

Re: Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread Mich Talebzadeh
e secure access >> only to the selected users to sensitive tables/database.* >> >> 2) *Failing to do so, we would like to remove/disable the SparkSQL >> context/feature for everyone. * >> >> >> >> Any pointers in this direction will be very valuable.

Re: Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread Deepak Sharma
tion as we do not want to give blanket > permission to everyone. > > > > We are looking forward to: > > 1) A *solution or a work around, by which we can give secure access > only to the selected users to sensitive tables/database.* > > 2) *Failing to do so, we

Controlling access to hive/db-tables while using SparkSQL

2016-08-30 Thread Rajani, Arpan
to remove/disable the SparkSQL context/feature for everyone. Any pointers in this direction will be very valuable. Thank you, Arpan This e-mail and any attachments are confidential, intended only for the addressee and may be privileged. If you have received this e-mail in error, please notify

Re: using matrix as column datatype in SparkSQL Dataframe

2016-08-10 Thread Yanbo Liang
A good way is to implement your own data source to load data of matrix format. You can refer the LibSVM data format ( https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/ml/source/libsvm) which contains one column of vector type which is very similar with matrix.

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Zoltan Fedor
ch are actually not used > by > > the UDF. Maybe the UDF serialization to Python serializes the whole row > > instead of just the attributes of the UDF? > > > > On Mon, Aug 8, 2016 at 5:59 PM, Davies Liu <dav...@databricks.com> > wrote: > >> > >

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Davies Liu
t; instead of just the attributes of the UDF? > > On Mon, Aug 8, 2016 at 5:59 PM, Davies Liu <dav...@databricks.com> wrote: >> >> On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor <zoltan.1.fe...@gmail.com> >> wrote: >> > Hi all, >> > >> > I have

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-09 Thread Zoltan Fedor
tributes of the UDF? On Mon, Aug 8, 2016 at 5:59 PM, Davies Liu <dav...@databricks.com> wrote: > On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor <zoltan.1.fe...@gmail.com> > wrote: > > Hi all, > > > > I have an interesting issue trying to use UDFs from SparkSQL in

回复:saving DF to HDFS in parquet format very slow in SparkSQL app

2016-08-09 Thread luohui20001
; 主题:saving DF to HDFS in parquet format very slow in SparkSQL app 日期:2016年08月09日 15点34分 hi there:I got a problem in saving a DF to HDFS as parquet format very slow. And I attached a pic which shows a lot of time is spent in getting result.the code is :streamingData.write.mode(SaveMode.Ove

saving DF to HDFS in parquet format very slow in SparkSQL app

2016-08-09 Thread luohui20001
hi there:I got a problem in saving a DF to HDFS as parquet format very slow. And I attached a pic which shows a lot of time is spent in getting result.the code is :streamingData.write.mode(SaveMode.Overwrite).parquet("/data/streamingData") I don't quite understand why my app is so slow in

Re: java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-08 Thread Davies Liu
On Mon, Aug 8, 2016 at 2:24 PM, Zoltan Fedor <zoltan.1.fe...@gmail.com> wrote: > Hi all, > > I have an interesting issue trying to use UDFs from SparkSQL in Spark 2.0.0 > using pyspark. > > There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300 > exe

java.lang.OutOfMemoryError: GC overhead limit exceeded when using UDFs in SparkSQL (Spark 2.0.0)

2016-08-08 Thread Zoltan Fedor
Hi all, I have an interesting issue trying to use UDFs from SparkSQL in Spark 2.0.0 using pyspark. There is a big table (5.6 Billion rows, 450Gb in memory) loaded into 300 executors's memory in SparkSQL, on which we would do some calculation using UDFs in pyspark. If I run my SQL on only

using matrix as column datatype in SparkSQL Dataframe

2016-08-08 Thread Vadla, Karthik
Hello all, I'm trying to load set of medical images(dicom) into spark SQL dataframe. Here each image is loaded into matrix column of dataframe. I see spark recently added MatrixUDT to support this kind of cases, but i don't find a sample for using matrix as column in dataframe.

How to avoid sql injection on SparkSQL?

2016-08-04 Thread Linyuxin
Hi All, I want to know how to avoid sql injection on SparkSQL Is there any common pattern about this? e.g. some useful tool or code segment or just create a “wheel” on SparkSQL myself. Thanks.

Re: SPARKSQL with HiveContext My job fails

2016-08-04 Thread Mich Talebzadeh
Well the error states Exception in thread thread_name: java.lang.OutOfMemoryError: GC Overhead limit exceeded Cause: The detail message "GC overhead limit exceeded" indicates that the garbage collector is

SPARKSQL with HiveContext My job fails

2016-08-04 Thread Vasu Devan
Hi Team, My Spark job fails with below error : Could you please advice me what is the problem with my job. Below is my error stack: 16/08/04 05:11:06 ERROR ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.actor.default-dispatcher-14] shutting down ActorSystem [sparkDriver]

Re: Any reference of performance tuning on SparkSQL?

2016-07-28 Thread Sonal Goyal
xin <linyu...@huawei.com> wrote: > Hi ALL > > Is there any reference of performance tuning on SparkSQL? > > I can only find about turning on spark core on http://spark.apache.org/ >

Any reference of performance tuning on SparkSQL?

2016-07-28 Thread Linyuxin
Hi ALL Is there any reference of performance tuning on SparkSQL? I can only find about turning on spark core on http://spark.apache.org/

WrappedArray in SparkSQL DF

2016-07-22 Thread KhajaAsmath Mohammed
Hi, I am reading JSON file and I am facing difficulties trying to get individula elements for this array. does anyone know how to get the elements from WrappedArray(WrappedArray(String)) Schema: ++ |rows| ++ |[WrappedArray(Bon...|

Re: Where is the SparkSQL Specification?

2016-07-21 Thread Mich Talebzadeh
Spark SQL is a subset of Hive SQL which by and large supports ANSI 92 SQL including search parameters like above scala> sqlContext.sql("select count(1) from oraclehadoop.channels where channel_desc like ' %b_xx%'").show +---+ |_c0| +---+ | 0| +---+ So check Hive QL Language support HTH Dr

Where is the SparkSQL Specification?

2016-07-21 Thread Linyuxin
Hi All Newbee here. My spark version is 1.5.1 And I want to know how can I find the Specification of Spark SQL to find out that if it is supported ‘a like %b_xx’ or other sql syntax

Re: How to Register Permanent User-Defined-Functions (UDFs) in SparkSQL

2016-07-12 Thread Daniel Darabos
there any other way to register UDFs in sparkSQL so that they remain > persistent? > > Regards > Lokesh >

How to Register Permanent User-Defined-Functions (UDFs) in SparkSQL

2016-07-10 Thread Lokesh Yadav
other way to register UDFs in sparkSQL so that they remain persistent? Regards Lokesh

SparkSQL Added file get Exception: is a directory and recursive is not turned on

2016-07-07 Thread linxi zeng
Hi, all: As recorded in https://issues.apache.org/jira/browse/SPARK-16408, when using Spark-sql to execute sql like: add file hdfs://xxx/user/test; If the HDFS path( hdfs://xxx/user/test) is a directory, then we will get an exception like: org.apache.spark.SparkException: Added file

Re: SparkSQL issue: Spark 1.3.1 + hadoop 2.6 on CDH5.3 with parquet

2016-06-20 Thread Satya
Hello, We are also experiencing the same error. Can you please provide the steps that resolved the issue. Thanks Satya -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-issue-Spark-1-3-1-hadoop-2-6-on-CDH5-3-with-parquet-tp22808p27197.html Sent from

SparkSql Catalyst extending Analyzer, Error with CatalystConf

2016-05-13 Thread sib
d passing it as SQLConf but get a not found error and importing doesn't seem to work. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSql-Catalyst-extending-Analyzer-Error-with-CatalystConf-tp26950.html Sent from the Apache Spark U

<    1   2   3   4   5   6   7   8   9   10   >