Here's my ideal use case scenario..

Create multiple datasets mapped to different file directories.
Create more datasets by logically generating additional computed columns using 
expressions
Create joined dataset by joining datasets
Finally run a Scanner on the joined dataset to start materialization..

Pyarrow.Dataset.filter supports adding a @filter, but it doesn't have a 
@columns argument.
Pyarrow.Dataset.Scanner supports both @filter and @columns, but I don't want to 
create interim copies of data in memory.

Simplified example:
Give a table that captures local values like 'en-US', 'en-GB', 'fr-CA', etc..
I want to use a pyarrow logical expression to split this into language and 
country so I end up with:
Language: 'en', 'en', 'fr', ..
Country: 'US', 'GB', 'CA', ..
I then want to join Country to a Country dataset which contains Country and 
Country_Name
Language: 'en', 'en', 'fr', ..
Country: 'US', 'GB', 'CA', ..
Country_Name: 'USA', 'Great Britain', 'Cananda', ..

Basically can a dataset handle "logical" column projection to avoid physical 
materialization in memory?


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2023 BlackRock, Inc. All rights reserved.

Reply via email to