Hi Francisco, I was hoping someone “more expert” would respond, but I’ll take a shot.
It appears that what you are describing might be a hierarchical classifier, which is an active region of data science research. Just to make sure, the sub-disciplines depend on the discipline, which in turn depends on the field. If that is true, welcome to the club. I am working on a similar problem with Occupational coding. We chose to classify at most specific level (for you that would be the sub-discipiline). That may (or may not) work for you. The problem with this solution is that it ignores the hierarchical nature of the problem. Here is an example that you may see… if a paper abstract is vague, you may know it is about virology, which is a 4-digit code (2420), but unable to classify beyond to the 6 digit level. you can build a classifier that initially classifies at the 2-digit level, then depending on the results pick a classifier for the 4-digit level and then a 6-digit, but there are two caveats. The first is that there is “drop-out” error at each level of classifier (i.e. if you are wrong at the 2-digit level, you cannot be right at the 4-digit). You may find that the drop-out rate is higher than the error rate for a “just-use-the-most-detailed-level" classifier. The second is that each classifier should be trained on different data to obtain the best results. It is not an easy task, I was unable to build this kind of a classifier that worked as well as “just-use-the-most-detailed-level” classifier. Your results may vary. One last thing you might try, build a classifier with ALL levels of the hierarchy. I never tried it because I can’t imagine it actually working. Hope it helps. Report back what you do. You may succeed where I failed. I would be happy to learn about your success. Daniel > On Oct 10, 2017, at 11:45 PM, FRANCISCO XAVIER SUMBA TORAL > <xavier.sumb...@ucuenca.ec> wrote: > > Hi, > > I’m having some troubles to tag some publications metadata with a taxonomy. > The problem is the following: > > > The UNESCO nomenclature [1] defines areas to classify research papers. Each > of these areas have a code and are divided in three levels [2]: 1) fields > (two-digit code), 2) disciplines (four-digit code), and 3) subdisciplines > (six-digit code). Then I'd like to map a publication with one of these areas. > So, when given publications' meta data, return UNESCO areas. > > For example, let's say I have the title, abstract, and keywords of a > publication. > > Input > > Title: Learning representations by back-propagating errors > > Abstract: There have been many attempts to design self-organizing neural > networks. The aim is to find a powerful synaptic modification rule that will > allow an arbitrarily connected neural network to develop an internal > structure that is appropriate for a particular task domain..... > > Keywords: Neural net, back-propagation, artificial intelligence. > > Output: If we get the areas in a bottom-up approach, we should get the > subdisciplines and it's easy to infer the other levels. BTW, it could be more > than one output for interdisciplinary publications or areas that have been > combined since the taxonomy hasn't been updated. So, I might get the > following subdiscipline based on the UNESCO taxonomy: 1203.04 Artificial > Intelligence > Computer Sciences > Mathematics. > > > So anyone can help me with some insights to implement a taxonomy matcher? or > some related work already done? > > Cheers. > > [1] https://en.wikipedia.org/wiki/UNESCO_nomenclature > <https://en.wikipedia.org/wiki/UNESCO_nomenclature> > [2] http://unesdoc.unesco.org/images/0008/000829/082946eb.pdf > <http://unesdoc.unesco.org/images/0008/000829/082946eb.pdf>