Re: Testing ETL with Spark using Pytest

Mich Talebzadeh Tue, 09 Feb 2021 08:15:07 -0800

Thanks Jerry for your comments.

The easiest option and I concur is to have all these fixture files
currently under fixtures package lumped together in conftest.py under
* tests* package.


Then you can get away all together from fixtures and it works. However, I
gather plug and play becomes less manageable when you have a large number
of fixtures (large being relative here). My main modules (not tests) are
designed to do ETL from any database that supports JDBC connections (bar
Google BigQuery that only works correctly with Spark API). You specify your
source DB and target DB in yml file for any pluggable JDBC database

Going back to Pytest, please  check this reference below for the reason for
fixtures packaging

How to modularize your py.test fixtures (github.com)
<https://gist.github.com/peterhurford/09f7dcda0ab04b95c026c60fa49c2a68>

With regard to your other point on fixtures (a fixture in each file), I
have this fixture *loadIntoMysqlTable() *where it uses the data frame
created in* extractHiveData*, reads sample records from Hive and populates
MySql test table. The input needed is the Dataframe that is constructed in
the fixture module extractHiveData which has been passed as parameter to
this. This is the only way it seems to work through my tests


@pytest.fixture(scope = "session")
def extractHiveData():
    # read data through jdbc from Hive
    spark_session = s.spark_session(ctest['common']['appName'])
    tableName = config['GCPVariables']['sourceTable']
    fullyQualifiedTableName = config['hiveVariables']['DSDB'] + '.' +
tableName
   house_df = s.loadTableFromHiveJDBC(spark_session,
fullyQualifiedTableName)
    # sample data selected equally n rows from Kensington and Chelsea and n
rows from City of Westminster
    num_rows = int(ctest['statics']['read_df_rows']/2)
    house_df = house_df.filter(col("regionname") == "Kensington and
Chelsea").limit(num_rows).unionAll(house_df.filter(col("regionname") ==
"City of Westminster").limit(num_rows))
    return house_df

@pytest.fixture(scope = "session")
def loadIntoMysqlTable(*extractHiveData*):
    try:
        *extractHiveData*. \
            write. \
            format("jdbc"). \
            option("url", test_url). \
            option("dbtable", ctest['statics']['sourceTable']). \
            option("user", ctest['statics']['user']). \
            option("password", ctest['statics']['password']). \
            option("driver", ctest['statics']['driver']). \
            mode(ctest['statics']['mode']). \
            save()
        return True
    except Exception as e:
        print(f"""{e}, quitting""")
        sys.exit(1)

Thanks again.


Mich


LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 9 Feb 2021 at 15:47, Jerry Vinokurov <grapesmo...@gmail.com> wrote:

> Hi Mich,
>
> I'm a bit confused by what you mean when you say that you cannot call a
> fixture in another fixture. The fixtures resolve dependencies among
> themselves by means of their named parameters. So that means that if I have
> a fixture
>
> @pytest.fixture
> def fixture1():
>     return SomeObj()
>
> and another fixture
>
> @pytest.fixture
> def fixture2(fixture1)
>     return do_something_with_obj(fixture1)
>
> my second fixture will simply receive the object created by the first. As
> such, you do not need to "call" the second fixture at all. Of course, if
> you had some use case where you were constructing an object in the second
> fixture, you could have the first return a class, or you could have it
> return a function. In fact, I have fixtures in a project that do both. Here
> they are:
>
> @pytest.fixture
> def func():
>
>     def foo(x, y, z):
>
>         return (x + y) * z
>
>     return foo
>
> That's a fixture that returns a function, and any test using the func
> fixture would receive that actual function as a value, which could then be
> invoked by calling e.g. func(1, 2, 3). Here's another fixture that's more
> like what you're doing:
>
>
> @pytest.fixture
> def data_frame():
>
>     return pd.DataFrame.from_records([(1, 2, 3), (4, 5, 6)], columns=['x', 
> 'y', 'z'])
>
> This one just returns a data frame that can be operated on.
>
> Looking at your setup, I don't want to say that it's wrong per se (it
> could be very appropriate to your specific project to split things up among
> these many files) but I would say that it's not idiomatic usage of pytest
> fixtures, in my experience. It feels to me like you're jumping through a
> lot of hoops to set up something that could be done quite easily and
> compactly in conftest.py. I do want to emphasize that there is no
> limitation on how fixtures can be used within functions or within other
> fixtures (which are also just functions), since the result of the fixture
> call is just some Python object.
>
> Hope this helps,
> Jerry
>
> On Tue, Feb 9, 2021 at 10:18 AM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> I was a bit confused with the use of fixtures in Pytest with the
>> dataframes passed as an input pipeline from one fixture to another. I wrote
>> this after spending some time on it. As usual it is heuristic rather than
>> anything overtly by the book so to speak.
>>
>> In PySpark and PyCharm you can ETTL from Hive to BigQuery or from Oracle
>> to Hive etc. However, for PyTest, I decided to use MySql as a database of
>> choice for testing with a small sample of data (200 rows). I mentioned
>> Fixtures. Simply put "Fixtures are* functions, which will run before
>> each test function to which it is applied, to prepare data*. Fixtures
>> are used to feed some data to the tests such as database connections". If
>> you have ordering like Read data (Extract), do something with it(
>> Transform) and save it somewhere (Load), using Spark then these are all
>> happening in memory with data frames feeding each other.
>>
>> The crucial thing to remember is that fixtures pass functions to each
>> other as parameters not by invoking them directly!
>>
>> Example  ## This is correct @pytest.fixture(scope = "session") def
>> transformData(readSourceData):  ## fixture passed as parameter # this is
>> incorrect (cannot call a fixture in another fixture) read_df =
>> readSourceData()  So This operation becomes
>>
>>  transformation_df = readSourceData. \ select( \ ....
>>
>> Say in PyCharm under tests package, you create a package "fixtures" (just
>> a name nothing to do with "fixture") and in there you put your ETL python
>> modules that prepare data for you. Example
>>
>> ### file --> saveData.py @pytest.fixture(scope = "session") def
>> saveData(transformData): # Write to test target table try: transformData. \
>> write. \ format("jdbc"). \ ....
>>
>>
>> You then drive this test by creating a file called *conftest.py *under*
>> tests* package. You can then instantiate  your fixture files by
>> referencing them in this file as below
>>
>> import pytest from tests.fixtures.extractHiveData import extractHiveData
>> from tests.fixtures.loadIntoMysqlTable import loadIntoMysqlTable from
>> tests.fixtures.readSavedData import readSavedData from
>> tests.fixtures.readSourceData import readSourceData from
>> tests.fixtures.transformData import transformData from
>> tests.fixtures.saveData import saveData from tests.fixtures.readSavedData
>> import readSavedData
>>
>> Then you have your test Python file say *test_oracle.py* under package
>> tests and then put assertions there
>>
>> import pytest from src.config import ctest
>> @pytest.mark.usefixtures("extractHiveData") def
>> test_extract(extractHiveData): assert extractHiveData.count() > 0
>> @pytest.mark.usefixtures("loadIntoMysqlTable") def
>> test_loadIntoMysqlTable(loadIntoMysqlTable): assert loadIntoMysqlTable
>> @pytest.mark.usefixtures("readSavedData") def
>> test_readSourceData(readSourceData): assert readSourceData.count() ==
>> ctest['statics']['read_df_rows']
>> @pytest.mark.usefixtures("transformData") def
>> test_transformData(transformData): assert transformData.count() ==
>> ctest['statics']['transformation_df_rows']
>> @pytest.mark.usefixtures("saveData") def test_saveData(saveData): assert
>> saveData
>> @pytest.mark.usefixtures("readSavedData")
>> def test_readSavedData(transformData, readSavedData): assert
>> readSavedData.subtract(transformData).count() == 0
>>
>> This is an illustration from PyCharm about directory structure unders
>> tests
>>
>>
>> [image: image.png]
>>
>>
>> Let me know your thoughts.
>>
>>
>> Cheers,
>>
>>
>> Mich
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>
> --
> http://www.google.com/profiles/grapesmoker
>

Re: Testing ETL with Spark using Pytest

Reply via email to