Re: PyCharm, Running spark-submit calling jars and a package at run time

Mich Talebzadeh Fri, 08 Jan 2021 08:13:28 -0800

Thanks Riccardo.

I am well aware of the submission form


However, my question relates to doing submission within PyCharm itself.

This is what I do at Pycharm *terminal* to invoke the module python

spark-submit --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar
\
 --packages com.github.samelamin:spark-bigquery_2.11:0.2.6
analyze_house_prices_GCP.py

However, at terminal run it does not pickup import dependencies in the code!

Traceback (most recent call last):
  File
"C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py",
line 8, in <module>
    import sparkstuff as s
ModuleNotFoundError: No module named 'sparkstuff'

The python code is attached, pretty simple

Thanks



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Fri, 8 Jan 2021 at 15:51, Riccardo Ferrari <ferra...@gmail.com> wrote:

> You need to provide your python dependencies as well. See
> http://spark.apache.org/docs/latest/submitting-applications.html, look
> for --py-files
>
> HTH
>
> On Fri, Jan 8, 2021 at 3:13 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have a module in Pycharm which reads data stored in a Bigquery table
>> and does plotting.
>>
>> At the command line on the terminal I need to add the jar file and the
>> packet to make it work.
>>
>> (venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>spark-submit
>> --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar
>> analyze_house_prices
>>
>> _GCP.py
>>
>> This works but the problem is that the imports into the module are not
>> picked up.  Example
>>
>>
>> import sparkstuff as s
>>
>>
>> This is picked up when run within Pycharm itself but not at the command
>> line!
>>
>>
>> (venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>spark-submit
>> --jars ..\lib\spark-bigquery-with-dependencies_2.12-0.18.0.jar
>> analyze_house_prices
>>
>> _GCP.py
>>
>> Traceback (most recent call last):
>>
>>   File
>> "C:/Users/admin/PycharmProjects/pythonProject2/DS/src/analyze_house_prices_GCP.py",
>> line 8, in <module>
>>
>>     import sparkstuff as s
>>
>> ModuleNotFoundError: No module named 'sparkstuff'
>>
>> The easiest option would be to run all these within PyCharm itself
>> invoking the jar file and package at runtime.
>>
>> Otherwise I can run it at the command line but being able to resolve
>> imports. I appreciate any work-around this.
>>
>> Thanks
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>

from __future__ import print_function
import sys
import os
import logger
import findspark
findspark.init()
from pyspark.sql import functions as F
import sparkstuff as s
import usedFunctions as uf
try:
  import variables as v
except ModuleNotFoundError:
  from conf import variables as v
from pyspark.sql import SparkSession
from pyspark import SparkContext
from pyspark.sql import SQLContext, HiveContext
from google.cloud import bigquery
from google.oauth2 import service_account
appName = "DS"
spark = s.spark_session(appName)
sc = s.sparkcontext()

lst = (spark.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ")).collect()
print("\nStarted at");uf.println(lst)

tmp_bucket = "tmp_storage_bucket/tmp"

# Set the temporary storage location
spark.conf.set("temporaryGcsBucket",tmp_bucket)
spark.sparkContext.setLogLevel("ERROR")

HadoopConf = sc._jsc.hadoopConfiguration()
HadoopConf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
HadoopConf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")

#bq = spark._sc._jvm.com.samelamin.spark.bigquery.BigQuerySQLContext(spark._wrapped._jsqlContext)

# needed filters

start_date = "2010-01-01"
end_date = "2020-01-01"

spark.conf.set("GcpJsonKeyFile",v.jsonKeyFile)
spark.conf.set("BigQueryProjectId",v.projectId)
spark.conf.set("BigQueryDatasetLocation",v.datasetLocation)
spark.conf.set("google.cloud.auth.service.account.enable", "true")
spark.conf.set("fs.gs.project.id", v.projectId)
spark.conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
spark.conf.set("fs.AbstractFileSystem.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
spark.conf.set("temporaryGcsBucket", v.tmp_bucket)


sqltext = ""
from pyspark.sql.window import Window

# read data from the Bigquery table in staging area
print("\nreading data from "+v.projectId+":"+v.inputTable)
source_df = spark.read. \
              format("bigquery"). \
              option("dataset", v.sourceDataset). \
              option("table", v.sourceTable). \
              load()

source_df.printSchema()
# Get the data pertaining to {v.regionname} for 10 years and save it in ds dataset
summary_df = source_df.filter((F.col("Date").between('2010-01-01', '2020-01-01'))  & (F.col("v.regionname") == 'Kensington and Chelsea'))
rows = summary_df.count()
print("Total number of rows for Kensington and Chelsea is ", rows)


# Save data to a BigQuery table
print("\nsaving data to " + v.outputTable)

# Save the result set to a BigQuery table. Table is created if it does not exist
print("\nsaving data to " + v.fullyQualifiedoutputTableId)
summary_df. \
    write. \
    format("bigquery"). \
    mode("overwrite"). \
    option("table", v.fullyQualifiedoutputTableId). \
    save()


lst = (spark.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy HH:mm:ss.ss') ")).collect()
print("\nFinished at");uf.println(lst)

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: PyCharm, Running spark-submit calling jars and a package at run time

Reply via email to