Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

Mohit Jaggi Sun, 12 Jun 2016 03:38:08 -0700

Looks like a bug in the code generating the SQL query…why would it be specific 
to SAS, I can’t guess. Did you try the same with another database? As a 
workaround you can write the select statement yourself instead of just 
providing the table name.


> On Jun 11, 2016, at 6:27 PM, Ajay Chander <itsche...@gmail.com> wrote:
> 
> I tried implementing the same functionality through Scala as well. But no 
> luck so far. Just wondering if anyone here tried using Spark SQL to read SAS 
> dataset? Thank you
> 
> Regards,
> Ajay
> 
> On Friday, June 10, 2016, Ajay Chander <itsche...@gmail.com 
> <mailto:itsche...@gmail.com>> wrote:
> Mich, I completely agree with you. I built another Spark SQL application 
> which reads data from MySQL and SQL server and writes the data into 
> Hive(parquet+snappy format). I have this problem only when I read directly 
> from remote SAS system. The interesting part is I am using same driver to 
> read data through pure Java app and spark app. It works fine in Java app, so 
> I cannot blame SAS driver here. Trying to understand where the problem could 
> be. Thanks for sharing this with me.
> 
> On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com 
> <javascript:_e(%7B%7D,'cvml','mich.talebza...@gmail.com');>> wrote:
> I personally use Scala to do something similar. For example here I extract 
> data from an Oracle table and store in ORC table in Hive. This is compiled 
> via sbt as run with SparkSubmit.
> 
> It is similar to your code but in Scala. Note that I do not enclose my column 
> names in double quotes.  
> 
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.functions._
> 
> object ETL_scratchpad_dummy {
>   def main(args: Array[String]) {
>   val conf = new SparkConf().
>                setAppName("ETL_scratchpad_dummy").
>                set("spark.driver.allowMultipleContexts", "true")
>   val sc = new SparkContext(conf)
>   // Create sqlContext based on HiveContext
>   val sqlContext = new HiveContext(sc)
>   import sqlContext.implicits._
>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   println ("\nStarted at"); sqlContext.sql("SELECT 
> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') 
> ").collect.foreach(println)
>   HiveContext.sql("use oraclehadoop")
>   var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
>   var _username : String = "scratchpad"
>   var _password : String = "xxxxxxxx"
> 
>   // Get data from Oracle table scratchpad.dummy
>   val d = HiveContext.load("jdbc",
>   Map("url" -> _ORACLEserver,
>   "dbtable" -> "(SELECT to_char(ID) AS ID, to_char(CLUSTERED) AS CLUSTERED, 
> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMISED) AS RANDOMISED, 
> RANDOM_STRING, SMALL_VC, PADDING FROM scratchpad.dummy)",
>   "user" -> _username,
>   "password" -> _password))
>  
>    d.registerTempTable("tmp")
>   //
>   // Need to create and populate target ORC table oraclehadoop.dummy
>   //
>   HiveContext.sql("use oraclehadoop")
>   //
>   // Drop and create table dummy
>   //
>   HiveContext.sql("DROP TABLE IF EXISTS oraclehadoop.dummy")
>   var sqltext : String = ""
>   sqltext = """
>   CREATE TABLE oraclehadoop.dummy (
>      ID INT
>    , CLUSTERED INT
>    , SCATTERED INT
>    , RANDOMISED INT
>    , RANDOM_STRING VARCHAR(50)
>    , SMALL_VC VARCHAR(10)
>    , PADDING  VARCHAR(10)
>   )
>   CLUSTERED BY (ID) INTO 256 BUCKETS
>   STORED AS ORC
>   TBLPROPERTIES (
>   "orc.create.index"="true",
>   "orc.bloom.filter.columns"="ID",
>   "orc.bloom.filter.fpp"="0.05",
>   "orc.compress"="SNAPPY",
>   "orc.stripe.size"="16777216",
>   "orc.row.index.stride"="10000" )
>   """
>    HiveContext.sql(sqltext)
>   //
>   // Put data in Hive table. Clean up is already done
>   //
>   sqltext = """
>   INSERT INTO TABLE oraclehadoop.dummy
>   SELECT
>           ID
>         , CLUSTERED
>         , SCATTERED
>         , RANDOMISED
>         , RANDOM_STRING
>         , SMALL_VC
>         , PADDING
>   FROM tmp
>   """
>    HiveContext.sql(sqltext)
>   println ("\nFinished at"); sqlContext.sql("SELECT 
> FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') 
> ").collect.foreach(println)
>   sys.exit()
>  }
> }
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 10 June 2016 at 23:38, Ajay Chander <itsche...@gmail.com <>> wrote:
> Hi Mich,
> 
> Thanks for the response. If you look at my programs, I am not writings my 
> queries to include column names in a pair of "". My driver in spark program 
> is generating such query with column names in "" which I do not want. On the 
> other hand, I am using the same driver in my pure Java program which is 
> attached, in that program the same driver is generating a proper sql query 
> with out "". 
> 
> Pure Java log:
> 
> 2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT 
> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result set 
> 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
> 
> Spark SQL log:
> 
> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT 
> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement 2.1; 
> time= 0.038 secs (com.sas.rio.MVAConnection:538)
> 
> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT 
> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set 
> 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
> 
> Please find complete program and full logs attached in the below thread. 
> Thank you.
> 
> Regards,
> Ajay
> 
> 
> On Friday, June 10, 2016, Mich Talebzadeh <mich.talebza...@gmail.com <>> 
> wrote:
> Assuming I understood your query, in Spark SQL (that is you log in to spark 
> sql like  spark-sql --master spark://<HOST_NAME>:7077 you do not need double 
> quotes around column names for sql to work
> 
> spark-sql> select "hello from Mich" from oraclehadoop.sales limit 1;
> hello from Mich
> 
> Anything between a pair of "" will be interpreted as text NOT column name.
> 
> In Spark SQL you do not need double quotes. So simply
> 
> spark-sql> select prod_id, cust_id from sales limit 2;
> 17      28017
> 18      10419
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 10 June 2016 at 21:54, Ajay Chander <itsche...@gmail.com <>> wrote:
> Hi again, anyone in this group tried to access SAS dataset through Spark SQL 
> ? Thank you
> 
> Regards,
> Ajay 
> 
> 
> On Friday, June 10, 2016, Ajay Chander <itsche...@gmail.com <>> wrote:
> Hi Spark Users,
> 
> I hope everyone here are doing great.
> 
> I am trying to read data from SAS through Spark SQL and write into HDFS. 
> Initially, I started with pure java program please find the program and logs 
> in the attached file sas_pure_java.txt . My program ran successfully and it 
> returned the data from Sas to Spark_SQL. Please note the highlighted part in 
> the log.
> 
> My SAS dataset has 4 rows,
> 
> Program ran successfully. So my output is,
> 
> [2016-06-10 10:35:21,584] INFO stmt(1.1)#executeQuery SELECT 
> a.sr_no,a.start_dt,a.end_dt FROM sasLib.run_control a; created result set 
> 1.1.1; time= 0.122 secs (com.sas.rio.MVAStatement:590)
> 
> [2016-06-10 10:35:21,630] INFO rs(1.1.1)#next (first call to next); time= 
> 0.045 secs (com.sas.rio.MVAResultSet:773)
> 
> 1,'2016-01-01','2016-01-31'
> 
> 2,'2016-02-01','2016-02-29'
> 
> 3,'2016-03-01','2016-03-31'
> 
> 4,'2016-04-01','2016-04-30'
> 
> 
> 
> Please find the full logs attached to this email in file sas_pure_java.txt.
> 
> _______________________
> 
> 
> 
> Now I am trying to do the same via Spark SQL. Please find my program and logs 
> attached to this email in file sas_spark_sql.txt .
> 
> Connection to SAS dataset is established successfully. But please note the 
> highlighted log below.
> 
> [2016-06-10 10:29:05,834] INFO conn(2)#prepareStatement sql=SELECT 
> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; prepared statement 2.1; 
> time= 0.038 secs (com.sas.rio.MVAConnection:538)
> 
> [2016-06-10 10:29:05,935] INFO ps(2.1)#executeQuery SELECT 
> "SR_NO","start_dt","end_dt" FROM sasLib.run_control ; created result set 
> 2.1.1; time= 0.102 secs (com.sas.rio.MVAStatement:590)
> 
> Please find the full logs attached to this email in file sas_spark_sql.txt 
> 
> I am using same driver in both pure java and spark sql programs. But the 
> query generated in spark sql has quotes around the column names(Highlighted 
> above).
> So my resulting output for that query is like this,
> 
> +-----+--------+------+
> |  _c0|     _c1|   _c2|
> +-----+--------+------+
> |SR_NO|start_dt|end_dt|
> |SR_NO|start_dt|end_dt|
> |SR_NO|start_dt|end_dt|
> |SR_NO|start_dt|end_dt|
> +-----+--------+------+
> 
> Since both programs are using the same driver com.sas.rio.MVADriver . 
> Expected output should be same as my pure java programs output. But something 
> else is happening behind the scenes. 
> Any insights on this issue. Thanks for your time.
> 
> 
> 
> Regards,
> 
> Ajay
> 
> 
>

Re: SAS_TO_SPARK_SQL_(Could be a Bug?)

Reply via email to