[ https://issues.apache.org/jira/browse/AIRFLOW-6481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jarek Potiuk resolved AIRFLOW-6481. ----------------------------------- Fix Version/s: 2.0.0 Resolution: Fixed > SalesforceHook attempts to use .str accessor on object dtype > ------------------------------------------------------------ > > Key: AIRFLOW-6481 > URL: https://issues.apache.org/jira/browse/AIRFLOW-6481 > Project: Apache Airflow > Issue Type: Bug > Components: hooks > Affects Versions: 1.10.7 > Reporter: Teddy Hartanto > Assignee: Teddy Hartanto > Priority: Minor > Fix For: 2.0.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > I've searched through Airflow's issues and couldn't find any report regarding > this. I wonder if I'm the only one who's facing this? > {noformat} > Panda version: 0.24.2{noformat} > *Bug description* > I'm using the SalesforceHook to fetch data from SalesForce and I encountered > this exception: > {code:java} > AttributeError: ('Can only use .str accessor with string values, which use > np.object_ dtype in pandas', ...) > {code} > The root of the problem is that some of the object in Salesforce has a column > with compound data type. Eg: User's address is a Python dict: > {code:java} > <class 'dict'>: {'city': None, 'country': 'my', 'geocodeAccuracy': None, > 'latitude': None, 'longitude': None, 'postalCode': None, 'state': None, > 'street': None}{code} > The problematic code is here: > {code:java} > if fmt == "csv": > # there are also a ton of newline objects > # that mess up our ability to write to csv > # we remove these newlines so that the output is a valid CSV format > self.log.info("Cleaning data and writing to CSV") > possible_strings = df.columns[df.dtypes == "object"] > df[possible_strings] = df[possible_strings].apply( > lambda x: x.str.replace("\r\n", "") > ) > df[possible_strings] = df[possible_strings].apply( > lambda x: x.str.replace("\n", "") > ) > # write the dataframe > df.to_csv(filename, index=False) > {code} > Because a Series containing Python dicts are also considered of dtype object, > they're assumed to be "possible_strings". And then, when .str is called on > that Series, the exception is thrown. > To fix it, we could explicitly cast the object type to string as such: > {code:java} > if fmt == "csv": > ... > df[possible_strings] = df[possible_strings].astype(str).apply( > lambda x: x.str.replace("\r\n", "") > ) > df[possible_strings] = df[possible_strings].astype(str).apply( > lambda x: x.str.replace("\n", "") > ) > {code} > I've tested this and it works for me. Could somebody help me verify that the > type conversion is indeed needed? If yes, I'm keen to submit a PR to fix this > with the unit test included. -- This message was sent by Atlassian Jira (v8.3.4#803005)