4. Querying 인공지능연구실.

Slides:

Advertisements

Similar presentations

이진 나무 구조 강윤섭 2008년 5월 23일.

Advertisements

컴퓨터와 인터넷.

DB 프로그래밍 학기.

DB 프로그래밍 학기.

[별첨] 특허 DB 구축 및 토픽 모델링 수행 과정 Flowchart, File List

한국통신 멀티미디어연구소 김 영 환 인터넷 정보검색 제 10회 한글 및 한국어 정보처리 학술대회 인간과 기계와 언어 한국통신 멀티미디어연구소 김 영 환

쉽게 배우는 알고리즘 6장. 해시 테이블Hash Table.

테이블 : 데이터베이스를 구성하는 요소로 같은 성격에 정보의 집합체. 레코드 : 하나의 정보를 가지고 있는 컬럼의 집합체

NLP Lab. 세미나 발표자:이주호 2007년 7월 18일 수요일

Word2Vec Tutorial 박 영택 숭실대학교.

4장. 웹로직 서버상에서의 JDBC와 JTA의 운용

UNIT 07 Memory Map 로봇 SW 교육원 조용수.

6장. 물리적 데이터베이스 설계 물리적 데이터베이스 설계

해싱(hashing) Ref. C로 쉽게 풀어쓴 자료구조, 생능출판사, 2005.

11장. 포인터 01_ 포인터의 기본 02_ 포인터와 Const.

정보기술을 이용한 단백질 서열 분석 (IT-based Protein Sequence Analysis)

Information Retrieval (Chapter 5: 질의연산)

17강. 데이터 베이스 - I 데이터 베이스의 개요 Oracle 설치 기본적인 SQL문 익히기

osp.chungbuk.ac.kr/2012년 강의자료

자료구조: CHAP 4 리스트 (3) 순천향대학교 컴퓨터공학과 하 상 호.

Chapter 10. 파일 시스템 인터페이스(File System Interface)

파일 시스템 인터페이스(File System Interface)

11.텍스트를 위한 화일.

Chap 6.Assembler 유건우.

인터넷응용프로그래밍 JavaScript(Intro).

2장 모델링 2.1 소개 2.2 정보 검색 모델의 분류체계 2.3 검색 : 축적과 여과 2.4 정보 검색 모델의 형식 특성

Linux/UNIX Programming

UNIT 07 Memory Map 로봇 SW 교육원 조용수.

제 8 장 객체지향 데이타베이스와 데이타베이스의 새로운 응용 분야

Java의 정석 제 5 장 배 열 Java 정석 남궁성 강의 의

정보 검색 연구 내용 및 연구 방향 충남대학교 정보통신공학부 맹 성 현 데이타베이스연구회 2000년도 춘계 튜토리얼

Linux/UNIX Programming

Chapter6 : JVM과 메모리 6.1 JVM의 구조와 메모리 모델 6.2 프로그램 실행과 메모리 6.3 객체생성과 메모리

컴퓨터 프로그래밍 기초 - 10th : 포인터 및 구조체 -

Managing Gigabytes 3. Indexing.

7장. 해시 테이블Hash Table.

CHAP 11 : 해싱.

Chapter 12 Memory Organization

제 1 장. 자료구조와 알고리즘.

시스템 분석 및 설계 글로컬 IT 학과 김정기.

CHAPTER 04 파일 설계(FiLE Design).

C언어 응용 제 15 주 검색.

Linux/UNIX Programming

Linux/UNIX Programming

객체기반 SW설계 팀활동지 4.

Canary value 스택 가드(Stack Guard).

Signature, Strong Typing

Signature, Strong Typing

2nd day Indexing and Slicing

알고리즘 알고리즘이란 무엇인가?.

데이터 동적 할당 Collection class.

12주 실습강의 학기, 소프트웨어 설계 및 실험(Ⅰ).

고급 정보 검색 1. 개 요.

Signature, Strong Typing

문서 클러스터링 일본언어문화학과 서동진.

Word Embedding.

Support Vector Machine

Summary of Pointers and Arrays

MATLAB Homework#6 Equalizer 기초

Numerical Analysis Programming using NRs

Bug Localization Based on Code Change Histories and Bug Reports

데이터 베이스의 내부 구조.

8장 선택 논리 II 1. 논리연산자 1.1 논리연산자 : AND (&&) 1.2 논리연산자 : OR (||)

제 4 장 Record.

NACST progress report 신수용.

교착 상태 해결 : 교착 상태 탐지 교착 상태 탐지(Deadlock Detection)

Linux/UNIX Programming

Text Clustering G 조한얼.

가상 기억장치 (Virtual Memory)

Presentation transcript:

4. Querying 인공지능연구실

차례 Accessing the lexicon Partially specified query terms Boolean query processing Ranking and information retrieval Evaluating retrieval effectiveness Implementation of the cosine measure Interactive retrieval

Preview(1) Boolean query and Ranked query Boolean query AND, OR, NOT과 같은 접속어로 연결된 term들의 list로 구성 high recall vs. high precision query에서의 작은 차이가 매우 다른 결과를 낼 수 있음 query를 만드는 데는 통찰력과 언어적인 기술이 요구됨

Preview(2) Ranked query query와 각 문서와의 similarity를 측정하기 위한 heuristic을 사용 측정한 numeric indicator를 바탕으로 r most closely matching document가 결과로 나오게 됨 ( r은 10이거나 100) low precision + high recall high precision + low recall

Accessing the lexicon lexicon for an inverted file index에는 term 뿐만 아니라 보조 정보도 저장 It : address in the inverted file of the corresponding list of document numbers ft : the number of documents containing the term lexicon들이 어떻게 저장되는 방식 중요

Access structures Array of records (Figure 4.1) 20byte string + 4byte address field + 4byte ft Array of string pointers (Figure 4.2) One long contiguous string + an array of 4-byte character pointer one word in four indexed (Figure 4.3) blocking을 이용하여 string pointer를 줄임 blocking은 searching process를 더 복잡하게 만듦

Front coding(1) 정렬된 list에서의 연속된 단어들은 같은 prefix를 공유할 가능성이 많다는 점 이용 complete front coding p.122 Table 4.1 참조 앞 단어와 같은 prefix characters의 개수 + 나머지 character들의 개수 + 남은 문자열 더 큰 lexicon에서는 더 많은 메모리를 절약할 수 있다. Binary search가 불가능

Font coding(2) Partial “3-in-4” front coding p.122 Table 4.1 참조 block pointer가 가리키는 단어들은 완전한 문자열과 길이를 유지 block내 마지막 단어는 suffix length가 생략됨 binary search 가능

Minimal perfect hashing(1) hash function : mechanism for mapping a set L of n keys xj into a set of integer values h(xj) in the range 1≤ h(xj)≤m, with duplicates allowed h(x) = x mod m (m > n/a, a : loading(record와 사용가능한 주소와의 비율)) a값이 작을 수록 두 key가 같은 hash value에서 충돌할 가능성이 낮아짐

Minimal perfect hashing(2) Perfect hash function hash function + i = j 일 때만 h(i) = h(j) no collision minimal perfect hash function(MPHF) perfect hash function m = n each of n keys hash to a unique integer between 1 and n a = 1.0 order-preserving minimal perfect hash function minimal perfect hash function + xi < xj then h(xi) < h(xj)

Minimal perfect hashing(3)

Design of a minimal perfect hash function(1) Probability or the birthday paradox n items are to be hashed into m slots m = 365 and n = 22 → 0.524, n = 23 → 0.493 Checking for acyclicity and assigning a mapping figure 4.5 참조

Design of a minimal perfect hash function(2) 14 13 1 12 5 11 2 7 11 10 6 2 4 3 9 10 3 8 4 9 1 5 6 8 7

Design of a minimal perfect hash function(3) Generating a perfect hash function m값을 정한다 w1[i]와 w2[i]를 랜덤하게 정한다 그래프 G=(V,E)를 생성한다(V={1,…,m}, E={(h1(t), h2(t))}t∈L}) figure 4.5를 이용해서 mapping g를 계산 labeling algotirhm이 실패하면 step 2로 돌아간다 w1, w2, g를 return

Disk-based lexicon storage p.131 Figure 4.7 참조 lexicon을 저장하는데 primary memory를 줄일 수 있는 방법(put it on disk) primary memory에는 각 term에 해당하는 disk block을 구별하기 위한 정보 저장 정보를 찾을 때는 in-memory index에서 block number를 찾은 다음 그 block을 buffer로 읽어 와서 계속 찾음 간단하고 작은 양의 primary memory 필요 비교적 작은 양의 lexicon을 작은 workstation이나 personal computer에 저장할 때 효과적

Partially specified query terms(1) 찾고자 하는 term이 wildcard character, *,를 포함하고 있을 때 Brute force string matching 전체 lexicon을 main memory에 있다면 fast pattern-matching algorithm을 이용하여 모든 term들과 pattern을 비교할 수 있음 Indexing using n-grams p.133 Table 4.3 참조 query term을 n-grams으로 분해 false match가 일어날 수 있으므로 결과를 다시 pattern matcher를 이용하여 체크해야함 ( ex : lab*r laaber, lavacaber)

Partially specified query terms(2) Rotated lexicons p.135 Table 4.4 참조 indexing pointer를 각각의 lexicon character에 유지 속도는 빠르나 메모리 사용이 많다. wildcard character가 앞뒤에 다 있을 때는 하나로 통일 multiple wildcard character query에서 가장 긴 character가 match되는 후보 set을 만듦 each specified component를 차례로 검사 pattern matcher를 이용하여 check

Boolean query processing(1) 간단하고, AND, OR, NOT과 같은 접속어로 연결된 term들의 list로 구성 inverted file index를 이용하여 처리할 때 비교적 수월함 Conjunctive queries 모든 term이 AND operation으로 연결됨 처리과정 각각의 term들을 기본형으로 만들고 lexicon에서 찾는다 term들을 frequency라 증가하는 순서로 정렬 least frequent term에 대한 inverted file entry를 메모리로 가져온다 다른 term들의 entry들을 위의 후보 set에 대해 처리

Boolean query processing(2) Term processing order 후보 set을 만들 때 가장 least frequent term을 선택하는 이유 query processing에 요구되는 메모리 공간을 양을 줄이기 위해서 frequency 순서대로 처리하면 더 빠르기 때문에 (look-up operation과 merge operation에 드는 시간 비교)  |C| = 60, ft = 60,000 : 압축이 안 되었다면!!! 60,060 vs. 60 compressed inverted file을 이용하면 공간은 많이 절약되지만 conjunctive query processing 중에 시간이 더 필요함(압축을 풀고 merge하는데 드는 시간)

Boolean query processing(3) Random access and fast lookup 빠른 searching을 지원하기 위해서는 inverted file entry내에서 random access가 제공되어야 함 세 가지 issue storage mechanism used for the index suitable value for bt, the blocking constant for term t trade-off of time for space Nonconjunctive queries Boolean query expression이 복잡해질 때는 informal or ranked query를 이용

Ranking and Information Retrieval Boolean Query Data에 대한 정보가 확실히 알려져 있는 경우 Exact search가 가능 Commercial DB, Bibliographic system Coordinate matching Query term의 수가 더 많은 문서가 더 유사하다 한 개의 term이라도 있으면 관련된 문서로 판단 Ranking Query와 document간의 유사도를 측정 유사도가 큰 문헌 순서대로 정렬하여 display

Inner Product Similarity Vector의 내적을 이용한 유사도 계산 Query를 document로 취급하여 계산 문제점 Term frequency의 고려가 없다 Scarcity의 고려가 없다 대책 Vector 값을 binary에서 정수로 불용어 제거, idf 사용

Vector Space Models Query와 Document를 벡터 공간 상에서 한 점으로 취급 벡터 공간은 문서 collection에 나타나는 색인어에 의해 결정 벡터 공간 모델에서 두 문서간의 유사도는 두 벡터 사이의 각의 cosine값을 이용 각이 작을수록 유사도가 높다는 아이디어에서 출발 문서의 길이가 긴 경우 유사도가 커질 가능성을 배제

Weight Vector TF만으로는 문제점 발생 Weight : idf를 사용 TF*IDF 사용 문서집단의 크기, text의 길이, 단어의 사용빈도를 고려하지 않음 Weight : idf를 사용 TF*IDF 사용 실제 Query에서는 TF가 의미가 없음

Cosine Measure

Evaluating retrieval effectiveness 얼마나 많은 적합문서가 검색되었나? Recall(재현율) 전체 적합 문서 중 검색된 적합 문서 수 Precision(정확도) 검색된 문서 중 적합한 문서 수 TREC IR System의 평가자료

Inverted file의 구성 Inverted file 을 만들 때 frequency를 삽입 <apple;3;[1,2,5]> <apple;3;[(1,3),(2,1),(5,1)]> 문서 set이 작을 경우 : unary code 평균적으로 감마코드가 이상적

Implementation of the cosine measure Issue : Memory 냐 Disk 냐 ? 전체 data에 대해 모두 처리할 것이냐, 일부만을 처리할 것이냐 ? N-best algorithm

Interactive retrieval 왜 필요한가 ? 개념은 있는데 단어가 생각나지 않을 때 개념과 뜻이 통하지 않는 단어를 사용 개념을 나타내지 못하는 적은 수의 단어를 사용할 때 검색결과가 너무 적을 때 어떻게 해결할 수 있는가 ? 사용자 질의어와 의미가 비슷한 말들을 추가하는 방법 사용자가 좋은 질의를 할 수 있도록 질의어를 추천해 주는 시스템 검색결과를 보고 사용자가 정보를 더 추가하면 시스템이 이를 처리해 사용자의 의도와 맞는 결과를 다시 보내 주는 방법

Relevance Feedback

Probabilistic models 개념 적합문서와 부적합문서 내의 색인어 출현 정보에 의존 특정 query에 대해 각 문서가 적합할 확률과 부적합할 확률을 계산하여 적합할 확률이 큰 문서를 검색 적합문서와 부적합문서 내의 색인어 출현 정보에 의존 Bayes’ theorem에 기초 이전에 적합성 정보가 준비되어 있어야 한다.