Solved. It was indeed a driver issue. As per https://docs.databricks.com/data/data-sources/sql-databases.html <https://docs.databricks.com/data/data-sources/sql-databases.html> and https://stackoverflow.com/questions/52718788/how-to-read-data-from-mariadb-using-spark-java/53214165#53214165 <https://stackoverflow.com/questions/52718788/how-to-read-data-from-mariadb-using-spark-java/53214165#53214165> the url string needs to be ‘jdbc:mysql…’ not ‘jdbc:mariadb’, which is what our devops had set up. It reads just fine now.
Regards, Alun ap Rhisiart > On 16 Apr 2021, at 18:28, Alun ap Rhisiart <cod...@alunaprhisiart.uk> wrote: > > Thanks, Mich, > > Yes, I’m coming to the conclusion that there is a driver issue. Our ops guy > is back next Monday, so I’ll see what he can do. I tried things like adding a > WHERE clause so that I was sure that the row received had no nulls, in case > that was the problem, but it was exactly the same. > > Regards, > > Alun > > >> On 16 Apr 2021, at 17:05, Mich Talebzadeh <mich.talebza...@gmail.com >> <mailto:mich.talebza...@gmail.com>> wrote: >> >> thanks for the info Alun. >> >> From my experience these issues of messed up columns through JDBC are >> usually related to the JDBC driver being used. >> >> Your database is MariaDB. I have no direct experience of this database but >> it is akin to MySQL. Case in point, I had all sorts of issues connecting to >> Hive through JDBC connection from Cloud to on-premise after trying four >> different drivers., only hive_driver: com.cloudera.hive.jdbc41.HS2Driver >> worked >> >> Can you try a few other drivers that support SSL for MariaDB from different >> vendors? >> >> Powerful MySQL JDBC Driver Download | Progress DataDirect >> <https://www.progress.com/jdbc> >> >> I guess MySql driver may work. Just try these drivers. >> >> You can either add the jar file to $SPARK_HOME/jars or to >> $SPARK_HOME/conf/spark-defaults.conf >> >> spark.driver.extraClassPath /data6/hduser/mariaDB_specific.jar >> >> >> HTH >> >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> Disclaimer: Use it at your own risk. Any and all responsibility for any >> loss, damage or destruction of data or any other property which may arise >> from relying on this email's technical content is explicitly disclaimed. The >> author will in no case be liable for any monetary damages arising from such >> loss, damage or destruction. >> >> >> >> On Fri, 16 Apr 2021 at 16:34, Alun ap Rhisiart <cod...@alunaprhisiart.uk >> <mailto:cod...@alunaprhisiart.uk>> wrote: >> Many thanks for the reply, Mich. I’m running Databricks on Azure. Databricks >> runtime version:8.1 ML (includes Apache Spark 3.1.1, Scala 2.12) >> >> The UUID columns, I believe, comes from Azure IoT. It is generally 36 >> characters, like '934c1f58-ed11-4e48-b157-aab869d9b325’, although I note >> some are shorter, possibly test data. The column is defined as VARCHAR(255). >> >> If I run the SQL outside of Spark I get exactly what I expect, the uuids in >> the first column, gender as ‘M’, ‘F’ or whatever in the second, and zeroes >> and nulls in the other columns. That is the puzzling part: that every >> variation of SQL I have tried works perfectly well in itself (in DataGrip >> eg), but returns junk when run in Spark. For the last four columns, and >> possibly the second, I could understand the nulls confusing it, but the bit >> I really don’t understand is why it says ‘uuid’ and ‘gender’, ie the column >> names, for all the rows of the first two columns. >> >> As for just downloading the tables and doing the join in spark data frames, >> I tried that but I hit the other issue I mention: >> >> SQLException: Out of range value for column 'id' : value id is not in Long >> range >> >> Even though ‘SELECT max(id) from devices’ returns 16091, and there are no >> nulls. Which is why I was doing the join in the DB. The IDs, as I mentioned, >> are all BigInt(20). >> >> >>> On 16 Apr 2021, at 15:53, Mich Talebzadeh <mich.talebza...@gmail.com >>> <mailto:mich.talebza...@gmail.com>> wrote: >>> >>> Hi, >>> >>> Which version of Spark are you using? >>> >>> UUID is generated by DB through OS low level call and it is 36 characters >>> >>> UUID=$(uuidgen) >>> echo $UUID >>> ef080790-4c3f-4a5f-8db7-1024338d34f2 >>> >>> >>> in other words string will do it or VARCHAR(36) >>> >>> When you run that SQL directly on the database itself what do you get? >>> >>> The alternative is to make two calls directly via JDBC to the underlying >>> database, get the data back into DF from those two tables and do the join >>> in Pyspark itself as a test >>> >>> Spark connection tyo and DB which allows JDBC is generic >>> >>> def loadTableFromJDBC(spark, url, tableName, user, password, driver, >>> fetchsize): >>> try: >>> df = spark.read. \ >>> format("jdbc"). \ >>> option("url", url). \ >>> option("dbtable", tableName). \ >>> option("user", user). \ >>> option("password", password). \ >>> option("driver", driver). \ >>> option("fetchsize", fetchsize). \ >>> load() >>> return df >>> except Exception as e: >>> print(f"""{e}, quitting""") >>> sys.exit(1) >>> >>> HTH >>> >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any >>> loss, damage or destruction of data or any other property which may arise >>> from relying on this email's technical content is explicitly disclaimed. >>> The author will in no case be liable for any monetary damages arising from >>> such loss, damage or destruction. >>> >>> >>> >>> On Fri, 16 Apr 2021 at 12:21, Alun ap Rhisiart <cod...@alunaprhisiart.uk >>> <mailto:cod...@alunaprhisiart.uk>> wrote: >>> I’m just starting using PySpark (Databricks) for a education application. >>> Part of this is monitoring children’s online behaviour to alert teachers >>> whether there may be problems with bullying, extreme diets, suicide >>> ideation, and so on. I have IoT data which I need to combine with >>> information from MariaDB (this is all in Azure). I have SparkJDBC42 and >>> mariadb_java_client_2_7_2 jars installed. The connection to the database is >>> established, in that I can see it can retrieve the schema for tables. >>> >>> I have a couple of issues. The first is that I can never retrieve any id >>> columns (which are all defined as BigInt(20)), as I get a ‘long out of >>> range’ error. I’m currently working around that by not including the ids >>> themselves in the return. However, the big problem is that I get the column >>> names returned in each row instead of the values for each row, where the >>> columns are defined as strings (VARCHAR etc). Also, for columns defined as >>> TinyInt they are returned as booleans, but reversed (0 is returned as >>> True). I have tried running the SQL outside of databricks/Spark (eg in >>> DataGrip) and it returns perfectly sensible data every time. >>> >>> The code at gist:412e1f3324136a574303005a0922f610 >>> <https://gist.github.com/alunap/412e1f3324136a574303005a0922f610> >>> >>> >>> Returned: >>> +----+------+----+-----------+----+--------------+ >>> |uuid|gender| cpp|young_carer| spp|asylum_refugee| >>> +----+------+----+-----------+----+--------------+ >>> |uuid|gender|true| true|true| true| >>> |uuid|gender|true| true|true| true| >>> |uuid|gender|true| true|true| true| >>> |uuid|gender|true| true|true| true| >>> |uuid|gender|true| true|true| true| >>> |uuid|gender|true| true|true| true| >>> |uuid|gender|true| true|true| true| >>> |uuid|gender|true| true|true| true| >>> |uuid|gender|true| true|true| true| >>> |uuid|gender|true| true|true| true| >>> +----+------+----+-----------+----+--------------+ >>> only showing top 10 rows >>> >>> On the database, device.uuid field is VARCHAR(255) and contains valid uuids >>> (no nulls). >>> children.gender is VARCHAR(255) and contains ‘M’, ‘F’, ‘MALE’, ‘FEMALE’, >>> ‘NONE’, or null. >>> children.cpp, young_carer, spp, and asylum_refugee are all tinyint(1) = 0. >>> They are nearly all 0, but the first 10 rows contain some nulls. >>> >>> I tried enclosing the query with brackets ‘(SELECT…) t’ as I gather it is a >>> subquery, and I tried adding a WHERE d.uuid = ‘an id’ with an id being one >>> where there are no nulls in the column, but no difference. So, completely >>> baffled at this point. >>> >>> Thanks for any suggestions, >>> >>> Alun ap Rhisiart >> >> Alun ap Rhisiart >> >> attached you'll find my public key >> >> You can use this key to encrypt and secure our messages. >> >> To start using it, you'll need to install an OpenPGP software on your >> computer. Below you'll find a list of possible solutions for your operating >> system: >> >> macOS >> https://gpgtools.tenderapp.com/kb/how-to/first-steps-where-do-i-start-where-do-i-begin-setup-gpgtools-create-a-new-key-your-first-encrypted-mail >> >> <https://gpgtools.tenderapp.com/kb/how-to/first-steps-where-do-i-start-where-do-i-begin-setup-gpgtools-create-a-new-key-your-first-encrypted-mail> >> Linux https://ssd.eff.org/en/module/how-use-pgp-linux >> <https://ssd.eff.org/en/module/how-use-pgp-linux> >> Windows https://ssd.eff.org/en/module/how-use-pgp-windows-pc >> <https://ssd.eff.org/en/module/how-use-pgp-windows-pc> >> iOS https://itunes.apple.com/app/ipgmail/id430780873?mt=8 >> <https://itunes.apple.com/app/ipgmail/id430780873?mt=8> >> Android >> https://play.google.com/store/apps/details?id=org.sufficientlysecure.keychain >> >> <https://play.google.com/store/apps/details?id=org.sufficientlysecure.keychain> >> >> Please import the public key into your local OpenPGP Key-Manager. >> >> >> >> > > Alun ap Rhisiart > > attached you'll find my public key > > You can use this key to encrypt and secure our messages. > > To start using it, you'll need to install an OpenPGP software on your > computer. Below you'll find a list of possible solutions for your operating > system: > > macOS > https://gpgtools.tenderapp.com/kb/how-to/first-steps-where-do-i-start-where-do-i-begin-setup-gpgtools-create-a-new-key-your-first-encrypted-mail > > <https://gpgtools.tenderapp.com/kb/how-to/first-steps-where-do-i-start-where-do-i-begin-setup-gpgtools-create-a-new-key-your-first-encrypted-mail> > Linux https://ssd.eff.org/en/module/how-use-pgp-linux > <https://ssd.eff.org/en/module/how-use-pgp-linux> > Windows https://ssd.eff.org/en/module/how-use-pgp-windows-pc > <https://ssd.eff.org/en/module/how-use-pgp-windows-pc> > iOS https://itunes.apple.com/app/ipgmail/id430780873?mt=8 > <https://itunes.apple.com/app/ipgmail/id430780873?mt=8> > Android > https://play.google.com/store/apps/details?id=org.sufficientlysecure.keychain > <https://play.google.com/store/apps/details?id=org.sufficientlysecure.keychain> > > Please import the public key into your local OpenPGP Key-Manager. > > > > <Alun ap Rhisiart (0EEFB724) – Public.asc>