Re: [topbraid-users] Guidance for using EDG for data integration?

Irene Polikoff Tue, 14 Apr 2020 11:20:13 -0700

Hi Fan Li,

I think it depends on your context and the goals for using this information.


Will you be capturing more than just simple column metadata such as the name, 
datatype and what it maps to? 
For example, you could include data profiling e.g., number of null values, min 
and max values for each data element, etc. 

If so, then these will probably be changing from one month to another. If it is 
important for users to easily see what the table/dataset was like at each 
month, then you may consider creating new resources for each column/data 
element. If you are not very interested in readily available information about 
the dataset evolution and mainly want to know what is in the dataset today, 
then you definitely do not need new instances for data elements. Simply update 
them. The historical info could always be queried from the change history 
should you need it. Even if you are interested in making the history easily 
available to users, you could create a query for them to run against the change 
history.

In our experience, at least with databases, people are primarily interested in 
how it looks like today. By default, when we do the subsequent (second, third, 
etc.,) import of a data source, it is a delta of the previous import. The same 
tables and columns are kept if they continue to exist. Only profiling and 
sampling is updated for them. Any new schema elements are added. Elements that 
no longer exist are removed. Some of it is done automatically and some requires 
attention of curators because if you see one new column and one deleted column, 
EDG can’t know for sure if it is indeed one new and one deleted or an update to 
the previously existing one.

With this approach you would not need to do the remapping of data elements 
unless the data in them changed fundamentally.

If you are only interested in knowing that a given table has some specific data 
elements (columns) which map to something and no data profiling details are of 
interest, then are you getting monthly updates because the schema may change 
e.g., may be a column got added or deleted? In this case, I do not see even a 
small reason to create each month new resources for the data elements. 

If you need very easy access to the info about what elements were part of the 
table as of January, February, etc., you may consider creating a new instance 
of a table/dataset for each month. Otherwise, simply update the instance of the 
table you created originally. As per above, the historical info could be 
queried from the change history.


Regards,

Irene

> On Apr 14, 2020, at 11:40 AM, Fan Li <lifan0...@gmail.com> wrote:
> 
> Hi Irene, sorry for my delayed response. I have finally had an opportunity to 
> try it on the new 6.3 version. (By the way I really like the new user 
> experience which is fantastic.)
> 
> Thanks so much for pointing me to the right direction and I think the UI 
> layout you suggested is very intuitive and would work great with domain 
> experts.
> 
> One remaining question:
> Suppose we receive a new data file every month (same schema), do we create a 
> new "File Table" instance each time? If we do, can we associate the new 
> instance with the same set of "Table Columns" or we need to create new column 
> instances each time?
> 
> 
> On Wednesday, April 8, 2020 at 3:37:51 PM UTC-4, Irene Polikoff wrote:
> Hi Fan Li,
> 
> In this case, you would represent each spreadsheet as a dataset in a data 
> asset collection. Each column will become a dataset element.
> 
> For 6.4, we have added an Import feature that will create this information 
> from a spreadsheet. It will perform some profiling of data to populate 
> metadata.
> 
> 
> 
> If you are interested, may be we can arrange for a way for you to test it and 
> provide input. 
> 
> Without this feature, you could create dataset instances for each 
> spreadsheet, then use the plain spreadsheet importer to create data elements 
> for each dataset. Creating input for the importer will require manipulating 
> your data. Take the first row of your spreadsheet that lists the column names 
> and turn it into columns.
> 
> With respect to connecting the data elements from different datasets, I would 
> not use crosswalks. Crosswalks are primarily about mapping different 
> reference data, different glossaries, taxonomies, etc. You would not create a 
> separate data asset collection for each dataset. At least, I do not think you 
> would. You would most likely use a single data asset collection.
> 
> Then is a question of whether you would map the similar data elements to each 
> other or whether you would map all of them to a common business term. For 
> example, you may have different data elements (spreadsheet columns) capturing 
> gender information. It would make sense to create a business term Gender and 
> map all of them to it - in the hub and spoke type of approach. EDG has some 
> capabilities suggesting such mappings based on the available data and rules 
> about a business term. And, yes, SHACL is used for this.
> 
> Mapping one element to another make sense if you are trying to capture 
> lineage e.g., data from one dataset is used for another dataset and your goal 
> is to capture this.
> 
> Coming back to your question on “giving domain experts a visual tool to 
> create data mapping”, I would probably organize the editor UI for data assets 
> to display data elements and drag and drop to map. I am showing an example in 
> the screenshot below. My first panel contains business terms. My second panel 
> contains data elements. I use it to select a data element to be shown on a 
> form. Then, I could drag and drop from the business term table the relevant 
> term to map the data element two. If you were doing mapping between data 
> elements, you could have in the first panel data elements from one dataset 
> and in another panel data elements from another dataset. 
> 
> 
> 
> Alternatively or additionally, you could also do batch editing. For example, 
> you could select data elements from different datasets that represent let’s 
> say gender and batch edit all of them in one step to connect them to the same 
> business term - as opposed to doing one by one mapping. There are various 
> ways to accomplish this. For example, you could use the Asset List panel. You 
> could drag and drop different data elements (using search to find them) into 
> a list in order to assemble everything you want to edit as a group. Then 
> select all of them for editing. If you are familiar with Basket in TBC, Asset 
> Lists are similar to TBC baskets, but they are collaborative. Users can name 
> them and store them on the server to share with other users for collaborative 
> work and discussion.
> 
> Hopefully, this gives you some useful information.
> 
> Regards,
> 
> Irene
> 
>> On Apr 7, 2020, at 12:12 PM, Fan Li <lifa...@gmail.com <>> wrote:
>> 
>> Hi Irene,
>> 
>> Each spreadsheet represents data we received from a different customer. I 
>> would like to capture its metadata/descriptors such as column names, data 
>> types, number of records etc in EDG. As customers use slightly different 
>> terminologies, I also need to map the column names to a single schema so I 
>> can merge the data for reporting purpose.
>> 
>> 
>> On Tuesday, April 7, 2020 at 10:13:21 AM UTC-4, Irene Polikoff wrote:
>> Hi Fan Li.
>> 
>>> On Apr 7, 2020, at 7:51 AM, Fan Li <lifa...@gmail.com <http://gmail.com/>> 
>>> wrote:
>>> 
>>> I should have added that the immediate objectives are:
>>>     • Give domain experts a visual tool to create data mapping
>>>     • Use SHACL to describe & validate the harmonized data structure
>>> 
>>> 
>>> On Tuesday, April 7, 2020 at 7:45:22 AM UTC-4, Fan Li wrote:
>>> Hi TopBraid Community,
>>> 
>>> I have a use case where I need to map data sources (spreadsheets) of 
>>> different formats into a single schema. I was wondering how I should use 
>>> EDG on the data modeling aspect of this task.
>>>     • Should I use "Data Assets" to model each data source?
>> What kind of information are you planning to import into EDG? Is it some 
>> data in spreadsheets e.g., the actual information about lets say products or 
>> companies? Or do these spreadsheets contain information about data sources 
>> e.g., what datasets you have, what are the fields in each dataset, how many 
>> records in each dataset, etc.?
>> 
>> It would be useful if you could provide an example.
>>>     • Should I use "Crosswalks" for schema mapping?
>>>     • Is there a concrete example I can follow?
>>> Any guidance is appreciated!
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "TopBraid Suite Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to topbrai...@googlegroups.com <http://googlegroups.com/>.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/topbraid-users/80a1e323-0d3e-4a1d-bb8b-33898253242a%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/topbraid-users/80a1e323-0d3e-4a1d-bb8b-33898253242a%40googlegroups.com>.
>> 
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "TopBraid Suite Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to topbrai...@googlegroups.com <>.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/topbraid-users/358116ab-b2eb-4a2f-be3f-213c77253725%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/topbraid-users/358116ab-b2eb-4a2f-be3f-213c77253725%40googlegroups.com>.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "TopBraid Suite Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to topbraid-users+unsubscr...@googlegroups.com 
> <mailto:topbraid-users+unsubscr...@googlegroups.com>.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/topbraid-users/417843e4-80fa-4aeb-8ad4-42bfd3488b85%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/topbraid-users/417843e4-80fa-4aeb-8ad4-42bfd3488b85%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"TopBraid Suite Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to topbraid-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/topbraid-users/31718180-3342-4974-AF0D-8C09CC750C6D%40topquadrant.com.

Re: [topbraid-users] Guidance for using EDG for data integration?

Reply via email to