Here's my ideal use case scenario.. Create multiple datasets mapped to different file directories. Create more datasets by logically generating additional computed columns using expressions Create joined dataset by joining datasets Finally run a Scanner on the joined dataset to start materialization..
Pyarrow.Dataset.filter supports adding a @filter, but it doesn't have a @columns argument. Pyarrow.Dataset.Scanner supports both @filter and @columns, but I don't want to create interim copies of data in memory. Simplified example: Give a table that captures local values like 'en-US', 'en-GB', 'fr-CA', etc.. I want to use a pyarrow logical expression to split this into language and country so I end up with: Language: 'en', 'en', 'fr', .. Country: 'US', 'GB', 'CA', .. I then want to join Country to a Country dataset which contains Country and Country_Name Language: 'en', 'en', 'fr', .. Country: 'US', 'GB', 'CA', .. Country_Name: 'USA', 'Great Britain', 'Cananda', .. Basically can a dataset handle "logical" column projection to avoid physical materialization in memory? This message may contain information that is confidential or privileged. If you are not the intended recipient, please advise the sender immediately and delete this message. See http://www.blackrock.com/corporate/compliance/email-disclaimers for further information. Please refer to http://www.blackrock.com/corporate/compliance/privacy-policy for more information about BlackRock’s Privacy Policy. For a list of BlackRock's office addresses worldwide, see http://www.blackrock.com/corporate/about-us/contacts-locations. © 2023 BlackRock, Inc. All rights reserved.
