Re: Column names returned as rows and other jdbc weirdness --Solved

Alun ap Rhisiart Mon, 19 Apr 2021 03:09:19 -0700

Solved. It was indeed a driver issue. As per 
https://docs.databricks.com/data/data-sources/sql-databases.html 
<https://docs.databricks.com/data/data-sources/sql-databases.html> and 
https://stackoverflow.com/questions/52718788/how-to-read-data-from-mariadb-using-spark-java/53214165#53214165
 
<https://stackoverflow.com/questions/52718788/how-to-read-data-from-mariadb-using-spark-java/53214165#53214165>
 the url string needs to be ‘jdbc:mysql…’ not ‘jdbc:mariadb’, which is what our 
devops had set up. It reads just fine now.


Regards,

Alun ap Rhisiart

> On 16 Apr 2021, at 18:28, Alun ap Rhisiart <cod...@alunaprhisiart.uk> wrote:
> 
> Thanks, Mich, 
> 
> Yes, I’m coming to the conclusion that there is a driver issue. Our ops guy 
> is back next Monday, so I’ll see what he can do. I tried things like adding a 
> WHERE clause so that I was sure that the row received had no nulls, in case 
> that was the problem, but it was exactly the same.
> 
> Regards,
> 
> Alun
> 
> 
>> On 16 Apr 2021, at 17:05, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> thanks for the info Alun.
>> 
>> From my experience these issues of messed up columns through JDBC are 
>> usually related to the JDBC driver being used.
>> 
>> Your database is MariaDB. I have no direct experience of this database but 
>> it is akin to MySQL. Case in point, I had all sorts of issues connecting to 
>> Hive through JDBC connection from Cloud to on-premise after trying four 
>> different drivers., only hive_driver: com.cloudera.hive.jdbc41.HS2Driver 
>> worked
>> 
>> Can you try a few other drivers that support SSL for MariaDB from different 
>> vendors?
>> 
>> Powerful MySQL JDBC Driver Download | Progress DataDirect 
>> <https://www.progress.com/jdbc>
>> 
>> I guess MySql driver may work. Just try these drivers.
>> 
>> You can either add the jar file to $SPARK_HOME/jars or to 
>> $SPARK_HOME/conf/spark-defaults.conf
>> 
>> spark.driver.extraClassPath        /data6/hduser/mariaDB_specific.jar
>> 
>> 
>> HTH
>> 
>> 
>> 
>>    view my Linkedin profile 
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>  
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction. 
>>  
>> 
>> 
>> On Fri, 16 Apr 2021 at 16:34, Alun ap Rhisiart <cod...@alunaprhisiart.uk 
>> <mailto:cod...@alunaprhisiart.uk>> wrote:
>> Many thanks for the reply, Mich. I’m running Databricks on Azure. Databricks 
>> runtime version:8.1 ML (includes Apache Spark 3.1.1, Scala 2.12)
>> 
>> The UUID columns,  I believe, comes from Azure IoT. It is generally 36 
>> characters, like '934c1f58-ed11-4e48-b157-aab869d9b325’, although I note 
>> some are shorter, possibly test data. The column is defined as VARCHAR(255).
>> 
>> If I run the SQL outside of Spark I get exactly what I expect, the uuids in 
>> the first column, gender as ‘M’, ‘F’ or whatever in the second, and zeroes 
>> and nulls in the other columns. That is the puzzling part: that every 
>> variation of SQL I have tried works perfectly well in itself (in DataGrip 
>> eg), but returns junk when run in Spark. For the last four columns, and 
>> possibly the second, I could understand the nulls confusing it, but the bit 
>> I really don’t understand is why it says ‘uuid’ and ‘gender’, ie the column 
>> names, for all the rows of the first two columns.
>> 
>> As for just downloading the tables and doing the join in spark data frames, 
>> I tried that but I hit the other issue I mention: 
>> 
>> SQLException: Out of range value for column 'id' : value id is not in Long 
>> range
>> 
>> Even though ‘SELECT max(id) from devices’ returns 16091, and there are no 
>> nulls. Which is why I was doing the join in the DB. The IDs, as I mentioned, 
>> are all BigInt(20).
>> 
>> 
>>> On 16 Apr 2021, at 15:53, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> Which version of Spark are you using?
>>> 
>>> UUID is generated by DB through OS low level call and it is 36 characters
>>> 
>>> UUID=$(uuidgen)
>>> echo $UUID
>>> ef080790-4c3f-4a5f-8db7-1024338d34f2
>>> 
>>> 
>>> in other words string will do it or VARCHAR(36)
>>> 
>>> When you run that SQL directly on the database itself what do you get?
>>> 
>>> The alternative is to make two calls directly via JDBC to the underlying 
>>> database, get the data back into DF from those two tables and do the join 
>>> in Pyspark itself as a test
>>> 
>>> Spark connection tyo and DB which allows JDBC is generic
>>> 
>>> def loadTableFromJDBC(spark, url, tableName, user, password, driver, 
>>> fetchsize):
>>>     try:
>>>        df = spark.read. \
>>>             format("jdbc"). \
>>>             option("url", url). \
>>>             option("dbtable", tableName). \
>>>             option("user", user). \
>>>             option("password", password). \
>>>             option("driver", driver). \
>>>             option("fetchsize", fetchsize). \
>>>             load()
>>>        return df
>>>     except Exception as e:
>>>         print(f"""{e}, quitting""")
>>>         sys.exit(1)
>>> 
>>> HTH
>>> 
>>> 
>>> 
>>>    view my Linkedin profile 
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>  
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction. 
>>>  
>>> 
>>> 
>>> On Fri, 16 Apr 2021 at 12:21, Alun ap Rhisiart <cod...@alunaprhisiart.uk 
>>> <mailto:cod...@alunaprhisiart.uk>> wrote:
>>> I’m just starting using PySpark (Databricks) for a education application. 
>>> Part of this is monitoring children’s online behaviour to alert teachers 
>>> whether there may be problems with bullying, extreme diets, suicide 
>>> ideation, and so on. I have IoT data which I need to combine with 
>>> information from MariaDB (this is all in Azure). I have SparkJDBC42 and 
>>> mariadb_java_client_2_7_2 jars installed. The connection to the database is 
>>> established, in that I can see it can retrieve the schema for tables.
>>> 
>>> I have a couple of issues. The first is that I can never retrieve any id 
>>> columns (which are all defined as BigInt(20)), as I get a ‘long out of 
>>> range’ error. I’m currently working around that by not including the ids 
>>> themselves in the return. However, the big problem is that I get the column 
>>> names returned in each row instead of the values for each row, where the 
>>> columns are defined as strings (VARCHAR etc). Also, for columns defined as 
>>> TinyInt they are returned as booleans, but reversed (0 is returned as 
>>> True). I have tried running  the SQL outside of databricks/Spark (eg in 
>>> DataGrip) and it returns perfectly sensible data every time.
>>> 
>>> The code at gist:412e1f3324136a574303005a0922f610 
>>> <https://gist.github.com/alunap/412e1f3324136a574303005a0922f610>
>>> 
>>> 
>>> Returned:
>>> +----+------+----+-----------+----+--------------+
>>> |uuid|gender| cpp|young_carer| spp|asylum_refugee|
>>> +----+------+----+-----------+----+--------------+
>>> |uuid|gender|true|       true|true|          true|
>>> |uuid|gender|true|       true|true|          true|
>>> |uuid|gender|true|       true|true|          true|
>>> |uuid|gender|true|       true|true|          true|
>>> |uuid|gender|true|       true|true|          true|
>>> |uuid|gender|true|       true|true|          true|
>>> |uuid|gender|true|       true|true|          true|
>>> |uuid|gender|true|       true|true|          true|
>>> |uuid|gender|true|       true|true|          true|
>>> |uuid|gender|true|       true|true|          true|
>>> +----+------+----+-----------+----+--------------+
>>> only showing top 10 rows
>>> 
>>> On the database, device.uuid field is VARCHAR(255) and contains valid uuids 
>>> (no nulls). 
>>> children.gender is VARCHAR(255) and contains ‘M’, ‘F’, ‘MALE’, ‘FEMALE’, 
>>> ‘NONE’, or null.
>>> children.cpp, young_carer, spp, and asylum_refugee are all tinyint(1) = 0. 
>>> They are nearly all 0, but the first 10 rows contain some nulls.
>>> 
>>> I tried enclosing the query with brackets ‘(SELECT…) t’ as I gather it is a 
>>> subquery, and I tried adding a WHERE d.uuid = ‘an id’ with an id being one 
>>> where there are no nulls in the column, but no difference. So, completely 
>>> baffled at this point.
>>> 
>>> Thanks for any suggestions,
>>> 
>>> Alun ap Rhisiart
>> 
>> Alun ap Rhisiart
>> 
>> attached you'll find my public key
>> 
>> You can use this key to encrypt and secure our messages.
>> 
>> To start using it, you'll need to install an OpenPGP software on your 
>> computer.  Below you'll find a list of possible solutions for your operating 
>> system:
>> 
>> macOS        
>> https://gpgtools.tenderapp.com/kb/how-to/first-steps-where-do-i-start-where-do-i-begin-setup-gpgtools-create-a-new-key-your-first-encrypted-mail
>>  
>> <https://gpgtools.tenderapp.com/kb/how-to/first-steps-where-do-i-start-where-do-i-begin-setup-gpgtools-create-a-new-key-your-first-encrypted-mail>
>> Linux        https://ssd.eff.org/en/module/how-use-pgp-linux 
>> <https://ssd.eff.org/en/module/how-use-pgp-linux>
>> Windows      https://ssd.eff.org/en/module/how-use-pgp-windows-pc 
>> <https://ssd.eff.org/en/module/how-use-pgp-windows-pc>
>> iOS          https://itunes.apple.com/app/ipgmail/id430780873?mt=8 
>> <https://itunes.apple.com/app/ipgmail/id430780873?mt=8>
>> Android      
>> https://play.google.com/store/apps/details?id=org.sufficientlysecure.keychain
>>  
>> <https://play.google.com/store/apps/details?id=org.sufficientlysecure.keychain>
>> 
>> Please import the public key into your local OpenPGP Key-Manager.
>> 
>> 
>> 
>> 
> 
> Alun ap Rhisiart
> 
> attached you'll find my public key
> 
> You can use this key to encrypt and secure our messages.
> 
> To start using it, you'll need to install an OpenPGP software on your 
> computer.  Below you'll find a list of possible solutions for your operating 
> system:
> 
> macOS 
> https://gpgtools.tenderapp.com/kb/how-to/first-steps-where-do-i-start-where-do-i-begin-setup-gpgtools-create-a-new-key-your-first-encrypted-mail
>  
> <https://gpgtools.tenderapp.com/kb/how-to/first-steps-where-do-i-start-where-do-i-begin-setup-gpgtools-create-a-new-key-your-first-encrypted-mail>
> Linux https://ssd.eff.org/en/module/how-use-pgp-linux 
> <https://ssd.eff.org/en/module/how-use-pgp-linux>
> Windows       https://ssd.eff.org/en/module/how-use-pgp-windows-pc 
> <https://ssd.eff.org/en/module/how-use-pgp-windows-pc>
> iOS           https://itunes.apple.com/app/ipgmail/id430780873?mt=8 
> <https://itunes.apple.com/app/ipgmail/id430780873?mt=8>
> Android       
> https://play.google.com/store/apps/details?id=org.sufficientlysecure.keychain 
> <https://play.google.com/store/apps/details?id=org.sufficientlysecure.keychain>
> 
> Please import the public key into your local OpenPGP Key-Manager.
> 
> 
> 
> <Alun ap Rhisiart (0EEFB724) – Public.asc>

Re: Column names returned as rows and other jdbc weirdness --Solved

Reply via email to