Re: Column names returned as rows and other jdbc weirdness

2021-04-16 Thread Mich Talebzadeh
One more thing that I find useful is to use a tool like Squirre
l.client. It is basic but very
powerful to test JDBC connections.

SQuirreL SQL Client Home Page (sourceforge.net)


Just install it on a node in the cloud and try testing your connections. It
has SQL query tool as well.

HTH




   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 16 Apr 2021 at 18:28, Alun ap Rhisiart 
wrote:

> Thanks, Mich,
>
> Yes, I’m coming to the conclusion that there is a driver issue. Our ops
> guy is back next Monday, so I’ll see what he can do. I tried things like
> adding a WHERE clause so that I was sure that the row received had no
> nulls, in case that was the problem, but it was exactly the same.
>
> Regards,
>
> Alun
>
>
> On 16 Apr 2021, at 17:05, Mich Talebzadeh 
> wrote:
>
> thanks for the info Alun.
>
> From my experience these issues of messed up columns through JDBC are
> usually related to the JDBC driver being used.
>
> Your database is MariaDB. I have no direct experience of this database but
> it is akin to MySQL. Case in point, I had all sorts of issues connecting to
> Hive through JDBC connection from Cloud to on-premise after trying four
> different drivers., only hive_driver: com.cloudera.hive.jdbc41.HS2Driver
> worked
>
> Can you try a few other drivers that support SSL for MariaDB from
> different vendors?
>
> Powerful MySQL JDBC Driver Download | Progress DataDirect
> 
>
> I guess MySql driver may work. Just try these drivers.
>
> You can either add the jar file to $SPARK_HOME/jars or to
> $SPARK_HOME/conf/spark-defaults.conf
>
> spark.driver.extraClassPath/data6/hduser/mariaDB_specific.jar
>
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 16 Apr 2021 at 16:34, Alun ap Rhisiart 
> wrote:
>
>> Many thanks for the reply, Mich. I’m running Databricks on Azure.
>> Databricks runtime version: 8.1 ML (includes Apache Spark 3.1.1, Scala
>> 2.12)
>>
>> The UUID columns,  I believe, comes from Azure IoT. It is generally 36
>> characters, like '934c1f58-ed11-4e48-b157-aab869d9b325’, although I note
>> some are shorter, possibly test data. The column is defined as VARCHAR(255).
>>
>> If I run the SQL outside of Spark I get exactly what I expect, the uuids
>> in the first column, gender as ‘M’, ‘F’ or whatever in the second, and
>> zeroes and nulls in the other columns. That is the puzzling part: that
>> every variation of SQL I have tried works perfectly well in itself (in
>> DataGrip eg), but returns junk when run in Spark. For the last four
>> columns, and possibly the second, I could understand the nulls confusing
>> it, but the bit I really don’t understand is why it says ‘uuid’
>> and ‘gender’, ie the column names, for all the rows of the first two
>> columns.
>>
>> As for just downloading the tables and doing the join in spark data
>> frames, I tried that but I hit the other issue I mention:
>>
>> SQLException: Out of range value for column 'id' : value id is not in
>> Long range
>>
>> Even though ‘SELECT max(id) from devices’ returns 16091, and there are no
>> nulls. Which is why I was doing the join in the DB. The IDs, as I
>> mentioned, are all BigInt(20).
>>
>>
>> On 16 Apr 2021, at 15:53, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>> Which version of Spark are you using?
>>
>> UUID is generated by DB through OS low level call and it is 36 characters
>>
>> UUID=$(uuidgen)
>> echo $UUID
>> ef080790-4c3f-4a5f-8db7-1024338d34f2
>>
>>
>> in other words string will do it or VARCHAR(36)
>>
>> When you run that SQL directly on the database itself what do you get?
>>
>> The alternative is to make two calls directly via JDBC to the underlying
>> database, get the data back into DF from those two tables and do the join
>> in Pyspark itself as a test
>>
>> Spark connection tyo and DB which allows JDBC is generic
>>
>> def loadTableFromJDBC(spark, url, tableName, user, password, driver,
>> fetchsize):
>> try:
>>df = spark.read. \
>> format("jdbc"). \
>> option("url", url). \
>> option("dbtable", 

Re: Column names returned as rows and other jdbc weirdness

2021-04-16 Thread Alun ap Rhisiart
Thanks, Mich, Yes, I’m coming to the conclusion that there is a driver issue. Our ops guy is back next Monday, so I’ll see what he can do. I tried things like adding a WHERE clause so that I was sure that the row received had no nulls, in case that was the problem, but it was exactly the same.Regards,AlunOn 16 Apr 2021, at 17:05, Mich Talebzadeh  wrote:thanks for the info Alun.From my experience these issues of messed up columns through JDBC are usually related to the JDBC driver being used.Your database is MariaDB. I have no direct experience of this database but it is akin to MySQL. Case in point, I had all sorts of issues connecting to Hive through JDBC connection from Cloud to on-premise after trying four different drivers., only hive_driver: com.cloudera.hive.jdbc41.HS2Driver workedCan you try a few other drivers that support SSL for MariaDB from different vendors?Powerful MySQL JDBC Driver Download | Progress DataDirectI guess MySql driver may work. Just try these drivers.You can either add the jar file to $SPARK_HOME/jars or to $SPARK_HOME/conf/spark-defaults.confspark.driver.extraClassPath        /data6/hduser/mariaDB_specific.jarHTH



   view my Linkedin profile

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Fri, 16 Apr 2021 at 16:34, Alun ap Rhisiart  wrote:Many thanks for the reply, Mich. I’m running Databricks on Azure. Databricks runtime version: 8.1 ML (includes Apache Spark 3.1.1, Scala 2.12)The UUID columns,  I believe, comes from Azure IoT. It is generally 36 characters, like '934c1f58-ed11-4e48-b157-aab869d9b325’, although I note some are shorter, possibly test data. The column is defined as VARCHAR(255).If I run the SQL outside of Spark I get exactly what I expect, the uuids in the first column, gender as ‘M’, ‘F’ or whatever in the second, and zeroes and nulls in the other columns. That is the puzzling part: that every variation of SQL I have tried works perfectly well in itself (in DataGrip eg), but returns junk when run in Spark. For the last four columns, and possibly the second, I could understand the nulls confusing it, but the bit I really don’t understand is why it says ‘uuid’ and ‘gender’, ie the column names, for all the rows of the first two columns.As for just downloading the tables and doing the join in spark data frames, I tried that but I hit the other issue I mention: SQLException: Out of range value for column 'id' : value id is not in Long rangeEven though ‘SELECT max(id) from devices’ returns 16091, and there are no nulls. Which is why I was doing the join in the DB. The IDs, as I mentioned, are all BigInt(20).On 16 Apr 2021, at 15:53, Mich Talebzadeh  wrote:Hi,Which version of Spark are you using?UUID is generated by DB through OS low level call and it is 36 charactersUUID=$(uuidgen)echo $UUIDef080790-4c3f-4a5f-8db7-1024338d34f2in other words string will do it or VARCHAR(36)When you run that SQL directly on the database itself what do you get?The alternative is to make two calls directly via JDBC to the underlying database, get the data back into DF from those two tables and do the join in Pyspark itself as a testSpark connection tyo and DB which allows JDBC is genericdef loadTableFromJDBC(spark, url, tableName, user, password, driver, fetchsize):    try:       df = spark.read. \            format("jdbc"). \            option("url", url). \            option("dbtable", tableName). \            option("user", user). \            option("password", password). \            option("driver", driver). \            option("fetchsize", fetchsize). \            load()       return df    except Exception as e:        print(f"""{e}, quitting""")        sys.exit(1)HTH



   view my Linkedin profile

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Fri, 16 Apr 2021 at 12:21, Alun ap Rhisiart  wrote:I’m just starting using PySpark (Databricks) for a education application. Part of this is monitoring children’s online behaviour to alert teachers whether there may be problems with bullying, extreme diets, suicide ideation, and so on. I have IoT data which I need to combine with information from MariaDB (this is all in Azure). I have SparkJDBC42 and mariadb_java_client_2_7_2 jars installed. The connection to the database is established, in that I can see it can retrieve the schema for tables.I have a couple of 

Re: Column names returned as rows and other jdbc weirdness

2021-04-16 Thread Mich Talebzadeh
thanks for the info Alun.

>From my experience these issues of messed up columns through JDBC are
usually related to the JDBC driver being used.

Your database is MariaDB. I have no direct experience of this database but
it is akin to MySQL. Case in point, I had all sorts of issues connecting to
Hive through JDBC connection from Cloud to on-premise after trying four
different drivers., only hive_driver: com.cloudera.hive.jdbc41.HS2Driver
worked

Can you try a few other drivers that support SSL for MariaDB from different
vendors?

Powerful MySQL JDBC Driver Download | Progress DataDirect


I guess MySql driver may work. Just try these drivers.

You can either add the jar file to $SPARK_HOME/jars or to
$SPARK_HOME/conf/spark-defaults.conf

spark.driver.extraClassPath/data6/hduser/mariaDB_specific.jar


HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 16 Apr 2021 at 16:34, Alun ap Rhisiart 
wrote:

> Many thanks for the reply, Mich. I’m running Databricks on Azure.
> Databricks runtime version: 8.1 ML (includes Apache Spark 3.1.1, Scala
> 2.12)
>
> The UUID columns,  I believe, comes from Azure IoT. It is generally 36
> characters, like '934c1f58-ed11-4e48-b157-aab869d9b325’, although I note
> some are shorter, possibly test data. The column is defined as VARCHAR(255).
>
> If I run the SQL outside of Spark I get exactly what I expect, the uuids
> in the first column, gender as ‘M’, ‘F’ or whatever in the second, and
> zeroes and nulls in the other columns. That is the puzzling part: that
> every variation of SQL I have tried works perfectly well in itself (in
> DataGrip eg), but returns junk when run in Spark. For the last four
> columns, and possibly the second, I could understand the nulls confusing
> it, but the bit I really don’t understand is why it says ‘uuid’
> and ‘gender’, ie the column names, for all the rows of the first two
> columns.
>
> As for just downloading the tables and doing the join in spark data
> frames, I tried that but I hit the other issue I mention:
>
> SQLException: Out of range value for column 'id' : value id is not in Long
> range
>
> Even though ‘SELECT max(id) from devices’ returns 16091, and there are no
> nulls. Which is why I was doing the join in the DB. The IDs, as I
> mentioned, are all BigInt(20).
>
>
> On 16 Apr 2021, at 15:53, Mich Talebzadeh 
> wrote:
>
> Hi,
>
> Which version of Spark are you using?
>
> UUID is generated by DB through OS low level call and it is 36 characters
>
> UUID=$(uuidgen)
> echo $UUID
> ef080790-4c3f-4a5f-8db7-1024338d34f2
>
>
> in other words string will do it or VARCHAR(36)
>
> When you run that SQL directly on the database itself what do you get?
>
> The alternative is to make two calls directly via JDBC to the underlying
> database, get the data back into DF from those two tables and do the join
> in Pyspark itself as a test
>
> Spark connection tyo and DB which allows JDBC is generic
>
> def loadTableFromJDBC(spark, url, tableName, user, password, driver,
> fetchsize):
> try:
>df = spark.read. \
> format("jdbc"). \
> option("url", url). \
> option("dbtable", tableName). \
> option("user", user). \
> option("password", password). \
> option("driver", driver). \
> option("fetchsize", fetchsize). \
> load()
>return df
> except Exception as e:
> print(f"""{e}, quitting""")
> sys.exit(1)
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 16 Apr 2021 at 12:21, Alun ap Rhisiart 
> wrote:
>
>> I’m just starting using PySpark (Databricks) for a education application.
>> Part of this is monitoring children’s online behaviour to alert teachers
>> whether there may be problems with bullying, extreme diets, suicide
>> ideation, and so on. I have IoT data which I need to combine with
>> information from MariaDB (this is all in Azure). I have SparkJDBC42 and
>> mariadb_java_client_2_7_2 jars installed. The connection to the database is
>> established, in that I can see it can retrieve the schema for tables.
>>
>> I have a couple of issues. The first is that I can never retrieve any id
>> 

Re: Column names returned as rows and other jdbc weirdness

2021-04-16 Thread Alun ap Rhisiart
Many thanks for the reply, Mich. I’m running Databricks on Azure. Databricks runtime version: 8.1 ML (includes Apache Spark 3.1.1, Scala 2.12)The UUID columns,  I believe, comes from Azure IoT. It is generally 36 characters, like '934c1f58-ed11-4e48-b157-aab869d9b325’, although I note some are shorter, possibly test data. The column is defined as VARCHAR(255).If I run the SQL outside of Spark I get exactly what I expect, the uuids in the first column, gender as ‘M’, ‘F’ or whatever in the second, and zeroes and nulls in the other columns. That is the puzzling part: that every variation of SQL I have tried works perfectly well in itself (in DataGrip eg), but returns junk when run in Spark. For the last four columns, and possibly the second, I could understand the nulls confusing it, but the bit I really don’t understand is why it says ‘uuid’ and ‘gender’, ie the column names, for all the rows of the first two columns.As for just downloading the tables and doing the join in spark data frames, I tried that but I hit the other issue I mention: SQLException: Out of range value for column 'id' : value id is not in Long rangeEven though ‘SELECT max(id) from devices’ returns 16091, and there are no nulls. Which is why I was doing the join in the DB. The IDs, as I mentioned, are all BigInt(20).On 16 Apr 2021, at 15:53, Mich Talebzadeh  wrote:Hi,Which version of Spark are you using?UUID is generated by DB through OS low level call and it is 36 charactersUUID=$(uuidgen)echo $UUIDef080790-4c3f-4a5f-8db7-1024338d34f2in other words string will do it or VARCHAR(36)When you run that SQL directly on the database itself what do you get?The alternative is to make two calls directly via JDBC to the underlying database, get the data back into DF from those two tables and do the join in Pyspark itself as a testSpark connection tyo and DB which allows JDBC is genericdef loadTableFromJDBC(spark, url, tableName, user, password, driver, fetchsize):    try:       df = spark.read. \            format("jdbc"). \            option("url", url). \            option("dbtable", tableName). \            option("user", user). \            option("password", password). \            option("driver", driver). \            option("fetchsize", fetchsize). \            load()       return df    except Exception as e:        print(f"""{e}, quitting""")        sys.exit(1)HTH



   view my Linkedin profile

 Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  

On Fri, 16 Apr 2021 at 12:21, Alun ap Rhisiart  wrote:I’m just starting using PySpark (Databricks) for a education application. Part of this is monitoring children’s online behaviour to alert teachers whether there may be problems with bullying, extreme diets, suicide ideation, and so on. I have IoT data which I need to combine with information from MariaDB (this is all in Azure). I have SparkJDBC42 and mariadb_java_client_2_7_2 jars installed. The connection to the database is established, in that I can see it can retrieve the schema for tables.I have a couple of issues. The first is that I can never retrieve any id columns (which are all defined as BigInt(20)), as I get a ‘long out of range’ error. I’m currently working around that by not including the ids themselves in the return. However, the big problem is that I get the column names returned in each row instead of the values for each row, where the columns are defined as strings (VARCHAR etc). Also, for columns defined as TinyInt they are returned as booleans, but reversed (0 is returned as True). I have tried running  the SQL outside of databricks/Spark (eg in DataGrip) and it returns perfectly sensible data every time.The code at gist:412e1f3324136a574303005a0922f610Returned:++--++---++--+
|uuid|gender| cpp|young_carer| spp|asylum_refugee|
++--++---++--+
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
++--++---++--+
only showing top 10 rowsOn the database, device.uuid field is VARCHAR(255) and contains valid uuids (no nulls). children.gender is VARCHAR(255) and contains ‘M’, ‘F’, ‘MALE’, ‘FEMALE’, ‘NONE’, or null.children.cpp, young_carer, spp, and 

Re: Column names returned as rows and other jdbc weirdness

2021-04-16 Thread Mich Talebzadeh
Hi,

Which version of Spark are you using?

UUID is generated by DB through OS low level call and it is 36 characters

UUID=$(uuidgen)
echo $UUID
ef080790-4c3f-4a5f-8db7-1024338d34f2


in other words string will do it or VARCHAR(36)

When you run that SQL directly on the database itself what do you get?

The alternative is to make two calls directly via JDBC to the underlying
database, get the data back into DF from those two tables and do the join
in Pyspark itself as a test

Spark connection tyo and DB which allows JDBC is generic

def loadTableFromJDBC(spark, url, tableName, user, password, driver,
fetchsize):
try:
   df = spark.read. \
format("jdbc"). \
option("url", url). \
option("dbtable", tableName). \
option("user", user). \
option("password", password). \
option("driver", driver). \
option("fetchsize", fetchsize). \
load()
   return df
except Exception as e:
print(f"""{e}, quitting""")
sys.exit(1)

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 16 Apr 2021 at 12:21, Alun ap Rhisiart 
wrote:

> I’m just starting using PySpark (Databricks) for a education application.
> Part of this is monitoring children’s online behaviour to alert teachers
> whether there may be problems with bullying, extreme diets, suicide
> ideation, and so on. I have IoT data which I need to combine with
> information from MariaDB (this is all in Azure). I have SparkJDBC42 and
> mariadb_java_client_2_7_2 jars installed. The connection to the database is
> established, in that I can see it can retrieve the schema for tables.
>
> I have a couple of issues. The first is that I can never retrieve any id
> columns (which are all defined as BigInt(20)), as I get a ‘long out of
> range’ error. I’m currently working around that by not including the ids
> themselves in the return. However, the big problem is that I get the column
> names returned in each row instead of the values for each row, where the
> columns are defined as strings (VARCHAR etc). Also, for columns defined as
> TinyInt they are returned as booleans, but reversed (0 is returned as
> True). I have tried running  the SQL outside of databricks/Spark (eg in
> DataGrip) and it returns perfectly sensible data every time.
>
> The code at gist:412e1f3324136a574303005a0922f610
> 
>
>
> Returned:
> ++--++---++--+ |uuid|gender|
> cpp|young_carer| spp|asylum_refugee|
> ++--++---++--+ |uuid|gender|true|
> true|true| true| |uuid|gender|true| true|true| true| |uuid|gender|true|
> true|true| true| |uuid|gender|true| true|true| true| |uuid|gender|true|
> true|true| true| |uuid|gender|true| true|true| true| |uuid|gender|true|
> true|true| true| |uuid|gender|true| true|true| true| |uuid|gender|true|
> true|true| true| |uuid|gender|true| true|true| true|
> ++--++---++--+ only showing top 10 rows
>
> On the database, device.uuid field is VARCHAR(255) and contains valid
> uuids (no nulls).
> children.gender is VARCHAR(255) and contains ‘M’, ‘F’, ‘MALE’, ‘FEMALE’,
> ‘NONE’, or null.
> children.cpp, young_carer, spp, and asylum_refugee are all tinyint(1) = 0.
> They are nearly all 0, but the first 10 rows contain some nulls.
>
> I tried enclosing the query with brackets ‘(SELECT…) t’ as I gather it is
> a subquery, and I tried adding a WHERE d.uuid = ‘an id’ with an id being
> one where there are no nulls in the column, but no difference. So,
> completely baffled at this point.
>
> Thanks for any suggestions,
>
> Alun ap Rhisiart
>


Re: Loading Hadoop-Azure in Kubernetes

2021-04-16 Thread Pol Santamaria
Hi Nick,

You should look which spark version is "latest", understand which Hadoop
version was built "spark:latest" on top, and then check the compatibility
of Hadoop with the Azure libraries. In the past, I used the following
Dockerfile to experiment:

FROM gcr.io/spark-operator/spark:v3.0.0
USER root
ADD
https://repo1.maven.org/maven2/com/microsoft/azure/azure-storage/2.0.0/azure-storage-2.0.0.jar
/opt/spark/jars
ADD
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/2.7.7/hadoop-azure-2.7.7.jar
/opt/spark/jars
ADD
https://repo1.maven.org/maven2/com/azure/azure-storage-blob/12.8.0/azure-storage-blob-12.8.0.jar
/opt/spark/jars
ADD
https://repo1.maven.org/maven2/com/azure/azure-storage-common/12.8.0/azure-storage-common-12.8.0.jar
/opt/spark/jars


And the following properties:
spark.hadoop.fs.wasb.impl org.apache.hadoop.fs.azure.NativeAzureFileSystem
spark.hadoop.fs.AbstractFileSystem.wasb.impl org.apache.hadoop.fs.azure.Wasb


Good luck,

Pol Santamaria

On Fri, Apr 16, 2021 at 3:40 PM Nick Stenroos-Dam  wrote:

> Hello
>
>
>
> I am trying to load the Hadoop-Azure driver in Apache Spark, but so far I
> have failed.
>
> The plan is to include the required files in the docker image, as we plan
> on using a Client-mode SparkSession.
>
>
>
> My current Dockerfile looks like this:
> --
>
> FROM spark:latest
>
>
>
> COPY *.jar $SPARK_HOME/jars
>
>
>
> ENV
> SPARK_EXTRA_CLASSPATH="$SPARK_HOME/jars/hadoop-azure-3.2.0.jar:$SPARK_HOME/jars/azure-keyvault-core-1.2.4.jar:$SPARK_HOME/jars/azure-storage-8.6.6.jar:$SPARK_HOME/jars/azure-storage-8.6.6.jar:$SPARK_HOME/jars/jetty-util-ajax-9.3.24.v20180605.jar:$SPARK_HOME/jars/wildfly-openssl-2.1.3.Final.jar"
>
> ENV HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-azure-datalake"
> --
>
>
>
> In the directory I have the following dependencies:
>
> hadoop-azure-3.2.0.jar
>
> azure-storage-8.6.6.jar
>
> azure-keyvault-core-1.2.4.jar
>
> jetty-util-ajax-9.3.24.v20180605.jar
>
> wildfly-openssl-2.1.3.Final.jar
>
>
>
> (I have validated that these files are part of the image and located where
> I expect (/opt/spark/jars))
>
>
>
> When looking in the Spark UI – Environment, I can’t see that Hadoop-azure
> should be loaded.
>
> In addition when I try and read a file using the wasb:// schema, I get the
> following error:
>
> java.lang.classnotfoundexception: Class
> org.apache.hadoop.fs.azure.NativeAzureFileSystem not found
>


Loading Hadoop-Azure in Kubernetes

2021-04-16 Thread Nick Stenroos-Dam
Hello

I am trying to load the Hadoop-Azure driver in Apache Spark, but so far I have 
failed.
The plan is to include the required files in the docker image, as we plan on 
using a Client-mode SparkSession.

My current Dockerfile looks like this:

FROM spark:latest

COPY *.jar $SPARK_HOME/jars

ENV 
SPARK_EXTRA_CLASSPATH="$SPARK_HOME/jars/hadoop-azure-3.2.0.jar:$SPARK_HOME/jars/azure-keyvault-core-1.2.4.jar:$SPARK_HOME/jars/azure-storage-8.6.6.jar:$SPARK_HOME/jars/azure-storage-8.6.6.jar:$SPARK_HOME/jars/jetty-util-ajax-9.3.24.v20180605.jar:$SPARK_HOME/jars/wildfly-openssl-2.1.3.Final.jar"
ENV HADOOP_OPTIONAL_TOOLS="hadoop-azure,hadoop-azure-datalake"


In the directory I have the following dependencies:
hadoop-azure-3.2.0.jar
azure-storage-8.6.6.jar
azure-keyvault-core-1.2.4.jar
jetty-util-ajax-9.3.24.v20180605.jar
wildfly-openssl-2.1.3.Final.jar

(I have validated that these files are part of the image and located where I 
expect (/opt/spark/jars))

When looking in the Spark UI - Environment, I can't see that Hadoop-azure 
should be loaded.
In addition when I try and read a file using the wasb:// schema, I get the 
following error:
java.lang.classnotfoundexception: Class 
org.apache.hadoop.fs.azure.NativeAzureFileSystem not found


Column names returned as rows and other jdbc weirdness

2021-04-16 Thread Alun ap Rhisiart
I’m just starting using PySpark (Databricks) for a education application. Part 
of this is monitoring children’s online behaviour to alert teachers whether 
there may be problems with bullying, extreme diets, suicide ideation, and so 
on. I have IoT data which I need to combine with information from MariaDB (this 
is all in Azure). I have SparkJDBC42 and mariadb_java_client_2_7_2 jars 
installed. The connection to the database is established, in that I can see it 
can retrieve the schema for tables.

I have a couple of issues. The first is that I can never retrieve any id 
columns (which are all defined as BigInt(20)), as I get a ‘long out of range’ 
error. I’m currently working around that by not including the ids themselves in 
the return. However, the big problem is that I get the column names returned in 
each row instead of the values for each row, where the columns are defined as 
strings (VARCHAR etc). Also, for columns defined as TinyInt they are returned 
as booleans, but reversed (0 is returned as True). I have tried running  the 
SQL outside of databricks/Spark (eg in DataGrip) and it returns perfectly 
sensible data every time.

The code at gist:412e1f3324136a574303005a0922f610 



Returned:
++--++---++--+
|uuid|gender| cpp|young_carer| spp|asylum_refugee|
++--++---++--+
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
|uuid|gender|true|   true|true|  true|
++--++---++--+
only showing top 10 rows

On the database, device.uuid field is VARCHAR(255) and contains valid uuids (no 
nulls). 
children.gender is VARCHAR(255) and contains ‘M’, ‘F’, ‘MALE’, ‘FEMALE’, 
‘NONE’, or null.
children.cpp, young_carer, spp, and asylum_refugee are all tinyint(1) = 0. They 
are nearly all 0, but the first 10 rows contain some nulls.

I tried enclosing the query with brackets ‘(SELECT…) t’ as I gather it is a 
subquery, and I tried adding a WHERE d.uuid = ‘an id’ with an id being one 
where there are no nulls in the column, but no difference. So, completely 
baffled at this point.

Thanks for any suggestions,

Alun ap Rhisiart