Hello,

Please forgive what I'm sure is a frequent question, but I have not been able 
to find a reasonable solution to what I'm sure is a very standard issue.  I 
expect to have what I think must be a very common pattern:  a pipeline element 
retrieves a large data set, performs an expensive computation to create one or 
more new columns, and wants to save out the expanded data set for downstream 
pipeline elements which will consume some of the new data and some of the old.

As I understand, there is no way to alter a persisted data set.  Is that 
correct?  If so, how do others address this situation?  The obvious answer is 
to write a new data set, but that approach wastes space and encourages data 
duplication.  One could write out the new columns only to a new dataset, but 
then we have to manage links between data sets.  Is that managed by Arrow?  If 
not, are there standard extensions for managing the links, or is there a better 
way?

Thanks,
Bill

William F. Smith
Bioinformatician
BCforward
Lilly Biotechnology Center
10290 Campus Point Dr.
San Diego, CA 92121
[email protected]<mailto:[email protected]>

CONFIDENTIALITY NOTICE: This email message (including all attachments) is for 
the sole use of the intended recipient(s) and may contain confidential 
information. Any unauthorized review, use, disclosure, copying or distribution 
is strictly prohibited. If you are not the intended recipient, please contact 
the sender by reply email and destroy all copies of the original message.

Reply via email to