Rerendering Semantic Ontologies Automatic Extensions to UMLS through Corpus Analytic
Contents Abstract Introduction Semantic Rerendering Methodology Linguistic Rerendering Database Rerendering Methodology Seed Ontology Corpus preprocessing with UMLS types Inducing candidate subtypes Results NP analysis-based subtypes NP modifier-based extension(second level) Corpus-based identification of the instances of induced semantic categories Evaluation
Abstract Utility and Deficiencies of existing ontology resources Semantic Rerendering Technique for increasing the semantic type coverage of a specific ontology UMLS : Existing Type System The Unified Medical Language System Medline : Existing Corpus Online database of 11 million citations and abstracts from health and medical journals and other news sources Ontology + Corpus => new Ontology 특정 영역의 말뭉치에서, 거대한 말뭉치 해석을 통해 "의미적 분류"를 개선한다. 생물의학 용어 해석을 위한 개발툴에 목적을 둔 자료. Medline. corpus 말뭉치
UMLS This Ontology has many deficiencies
Medline Corpus based Technique We Choose this Corpus
Introduction NLP requires Semantic Typing(Ontology) UMLS is inadequate for NLP semantic tag “Amino Acid, Peptide, or Protein” (henceforth AAPP), there are 180,998 entries there are dozens of functional subtypes that are routinely distinguished by biologists but not in the UMLS
Introduction Example Text UMLS “For separation of nonpolar compounds, the prerun can be performed with hexane... The selection of this solvent might be considered ..” UMLS UMLS Metathesaurus types hexane as either ‘Organic Chemical’ or ‘Hazardous or Poisonous Substance’. The extraction of information from Corpus can improve ontology
Introduction Example Text Inhibitor UMLS [p21 inhibits the regulation of ...] ... [This inhibitor binds to ...] Inhibitor This type does not exist in UMLS For the sortal anaphor “this inhibitor(p21)”, it has problem. [A phosphorylates B.] ... [The phosphorylation of B ...] UMLS Both of which are of different types in the UMLS.
Introduction 1. Corpus analysis on compound nominal phrases that express unique functional behavior of the compound head. 2. Identification of functionally defined subtypes derived from bio-relation parsing and extraction from the corpus. The results of rerendering are evaluated for correctness against the original type system.
Semantic Rerendering Many NLP tasks in the service of information extraction can benefit from more accurate semantic typing of the syntactic constituents in the text Two strategies Linguistic Rerendering Syntactic and semantic analysis of NP structures in the text; Database Rerendering Analysis of “ad hoc abstractions” from a database of relations automatically derived from the corpus. rerendering을 사용해 기존의 type system을 다른 어플리케이션에 적용시킨다. 리랜더링이란 기존의 type system(UMLS)과 말뭉치(Medline)을 입력으로서 받아 개선하는 것이다. 2개의 전략을 기반으로 한다. 1. 언어학적 리랜더링. 문법적이고 의미적인 분석. 명사합성어 구조를 분석한다. 의문 : NP가 대체 멀까?? noun compound? 문법 언어적 특성을 살린 전략 명사그룹을 통해 서브타입 후보들을 정한다. 분류 시스템엔 명확하게 나타나 있지 않지만, 생물학자들에겐 흥미가 있는 기능 카테고리들 인산화제(인산 유도체?or 인산기), 수용체, 억제제 그리고, 서브타입들이 풍부한 것은 좋다. 등장하는 물질들마다 의미적 주석들(속성, 기능적 정의 분류 등)을 다는게 가능하다면, 더 많은 정보를 추출할 수 있고, 본문내용과 엮는 것도 가능해진다. 2. 데이타베이스 리랜더링. 특히 추상개념을 분석. 말뭉치를 통해 자동적으로 생성되는 관계들을 통해. 기존의 말뭉치를 통해 얻은 자료로 접근하는 전략
Linguistic Rerendering Subtype We use the syntax of noun groups to identify candidate subtypes to an existing UMLS type. That are of interest to biologists but which are not explicitly represented in the type system are functional categories. ex) phosphorylators, receptors, and inhibitors Example Subtypes in Receptor CB(2) receptor cannabinoid receptor cell receptor D1 dopamine receptor epidermal growth factor receptor functional GABAB receptor gastrin receptororphan receptor orphan nuclear receptor major fibronectin receptor mammalian skeletal muscle acetylcholine receptor normal receptor PTHrP receptor protein-coupled receptor ryanodine receptor Recent research on Extracting hyponym and other relations from corpora. (Hearst, 1992; Pustejovsky et al., 1997; Campbell & Johnson, 1999; Mani, 2002).
Linguistic Rerendering RHHR (righthand head rule, cf. Pustejovsky et al. (1997)) fightV + sportN = fightsportN (kind of sport) wheelN + chairN = wheel chairN (kind of chair) "chair" is headword, and can be "subtype" of furniture.
Linguistic Rerendering "rad wheel chair" is in Copus it is kind of "furniture" it is interesting enough headword "chair" will be subtype of "furniture" "wheel" as modifier can be subtype of "chair". "red" should be filtered out. 4번에서 의자는 또한 기능적 분류라고 볼 수 있다. 이것은 4번의 3번째 조건이라고 볼 수 있다. red 라는 단어는 여러 방법에 의해 걸러져야한다. 다른 형태로 wheel chair 라는 단어가 나올 때 red가 나타나지 않는다는 점을 통해서 알 수 있다. 또는 red라는 단어가 wheel chair가 아닌 다른 명사에 modifier로서 나타나게 되면 걸러야 할 대상이라는 것으로 간주 될 수 있다. (실제 논문 내용에 근거함)
Linguistic Rerendering And then, These instances(rad wheel chair) and their type bindings(wheel chair of chair) can be identified from the corpus using a number of standard methodologies developed in the field for the expansion of ontology coverage (Hearst, 1992; Campbell & Johnson, 1999; Mani, 2002). 표준 방법? 스탠다드 메소돌로지 라는 것 중에 하나가 아래 표다. 탬플릿!
Database Rerendering For relation R and each subtype N' of T, associate N' with X if Sim(N, R) > s e.g. Sim( "kinase", "phosphorylate"), Sim( "inhibitor", "inhibit" ) Pustejovsky et al. (2002) and Casta˜no et al. (2002). 최초에 무엇을 R로 잡을것이냐에 대해 의문이 생긴다. 아마도 UMLS에 있는 것 같다. 그렇게 R을 정한다음. 그것이 N과 관련이 있는지는 Sim함수에 맞기는 것이다.
Methodology - Seed Ontology UMLS is Seed Ontology UMLS Metathesaurus over 1.5 million string mappings single lexical items complex nominal phrases UMLS Semantic Network 134 semantic types hierachically arranged via the 'isa' relation interlinked by a set of secondary non-hierachical relation. SPECIALIST Lexicon ( UMLS Knowledge Source, 2001 ) R X Y 를 볼 때, UMLS의 구조를 대강 유추할 수 있다. R목록 N목록 등으로 구조적 나열이 있고, 그 아래로 일렬로 항목들이 있는 것이다. 그래서 is a 관계로 "구조적으로" 나뉘어지는건 R, N 이다. 그리고 R의 목록 안에서는 구조가 없다. 너무 극단적 해석일지도 모른다.
Methodology - Seed Ontology UMLS is ambiguous. But, this ambiguity essentially resolves itself. isozyme, which the seed UMLS types as either Enzyme or AAPP, will only be identified as a good candidate subtype for Enzyme. Enzyme Frequency is bigger than AAPPs isozyme 동일효소 enzyme 효소 AAPP 단백질? 그러나 이런 빈도기반의 판단이 항상 옳다고 볼 수 없다.
Methodology-Corpus preprocessing with UMLS types Medline around 40,000 items. ( relatively small ) were tokenized, stemmed, tagged were shallow-parsed 5 steps for typing T Find NP ( nominal compound ) 1. If a semantic typing is possible, it is assigned. 2. Try on headword ( RHHR ) 3. If it is OF-attachment form, try on NP-1 <NP-1> of <NP-2> 4. If they all fail, test for a match with morphological heuristics recognizing semantically vacant categories NUMERIC, ABBREVIATION, SINGLE CAPITAL LETTER 5. Strip a groups of suffixes and prefixes and perform on the remaining stem.
Methodology - Inducing candidate subtypes Headword was considered a candidate subtype of type T if it occured in more that 1% of all nominal chunks tagged as T. It is step 4 of Linguistic Rerendering. The candidate subtypes for the second (NP modifierbased) level of UMLS extension were identified using a combination of template and frequency-based filtering of noun phrases and the LCS (longest common subsequence) algorithm. It is step 5 of Linguistic Rerendering. Identification of sample instances for the induced types was performed over shallow-parsed text using syntactic pattern templates. It is the table of Linguistic Rerendering
Results - NP analysis-based subtypes Semantic typing over our sample set of Medline data produced type bindings for over 1 million noun phrases. Supertype is from UMLS Subtype is from Headword (step 4) frequency in Medline
Results - NP modifier-based extension(2-level) NPs headed by the word “receptor” comprise 87% of all NPs tagged as Receptor in our test corpus: Receptor 2820 integrin 91 receptor 2444 Enzyme has problem also We need Type Filtering enzyme 2level은 step5에 해당한다. 수식어구 기반의 서브타입 추가다. 분류이름의 반복 현상은 step4에서 일어날 수 있다고 2.1에서 나온다. 그러나 4.2에서는 이것이 2level(step5)에서 나타난다고 되어 있다. 아무래도...이것은 level에 상관없이 나타나는것 같다.
Results - NP modifier-based extension(2-level)
Consider the first level extension types for the categories below: Corpus-based identification of the instances of induced semantic categories For functionally defined semantic types, such as, “Chemical Viewed Functionally”, or “Indicator, Reagent, or Diagnostic Aid”, corpus-based derivation of instances for the induced subcategories is clearly much more feasible. Consider the first level extension types for the categories below:
Corpus-based identification of the instances of induced semantic categories 작용기 분류에 억제제가 있다. 기능 위주의 분류가 됨을 볼 수 있다. 개인적으로 별로 감흥이 없다. 생명공학을 잘 몰라서 그런가 보다.
Corpus-based identification of the instances of induced semantic categories Instances derived with the definitional construction template for subtypes of receptor
Evaluation In order to do an earnest evaluation of performance of the rerendering algorithm, we would need to run it on a much larger corpus. This would allow for better candidate choices for the portions of the procedure that have been plagued by sparsity (e.g., in NP modifier-based candidate subtype selection). It would increase the coverage in terms of instances for which the type bindings are produced in the new type system.
Usability in natural language applications dichloromethane는 디클로로메탄 이라 읽고. 영어발음은 다이클로오메틴 에 가깝다. 구글에서 찾아보면 용매제라는 것을 쉽게 알 수 있으나 UMLS에서는 나타나지 않고 있다. 이 부분의 논문 직역 : 리렌더링된 온톨로지는 다이클로오메틴에 용매를 타입핑하는 것을 허용한다.
Evaluation against existing ontologies We performed some test evaluations of the second-level extension subtypes against the Gene Ontology. we observed significant overlap in some categories. Thus, for example, the 388 second-level extension subtype candidates for receptor, 12% were identified as concept names in the Gene Ontology. Encouraging In the future better automated methods for the evaluation of rerendering results against the existing ontologies must be developed. The utility and usefulness of the rerendering algorithm must be evaluated vis-avis achieving improvement in precision and recall for client NLP applications. GO와의 비교에서 12% 라는 수치가 딱히 성능 평가와는 관련이 없어 보인다. 유전용어와의 비교에서 공통점이 많이 나타난다는 것은 당연한 것이 아닌가 싶다. 즉 당연하다는 것은 정상이라는 것이고, 자동생성된 것이 정상 작동한다는 것은 굉장하다는 의미가 아닐까? 장래성 부분을 보면, 결국 현재 제대로된 평가는 되지않았다는 것이다. 이게 논문이 된게 상당히 운이 좋았던것은 아닐까 하고 의심되는 순간이다.