N-gram Language Model

Language Model 언어 모델

Language Model 이란 단어 시퀀스에 확률을 할당하는 모델이다. (단어 시퀀스는 단어들로 이루어진 시퀀스, 즉 문장을 말한다.) 단어 시퀀스에 확률을 할당하기 위해 사용되는 방법은 ① 이전 단어들이 주어졌을 때 다음에 올 단어를 예측, ② 주변 (양쪽) 단어들로부터 가운데 단어를 예측하는 방법이 주로 사용된다.

이번 포스팅에서는 ① 이전 단어들( $w_1, w_2, …, w_{n-1}$ )이 주어졌을 때 다음에 올 단어( $w_n$ )를 예측하는 언어 모델에 대해서 이야기한다. 이 방법을 확률로 표현하면 아래와 같다.

$P (w_n|w_1, w_2, w_3,..., w_{n-1})$

단어 시퀀스 $W$ 의 확률은 아래와 같다.

$P(W) = P(w_1, w_2, w_3,..., w_n) = \Pi_{i=1}^nP(w_i|w_1,...,w_{i-1})$

예를 들어 “An adorable little boy is spreading smiles”라는 문장의 확률 $P(An\ adorable\ little\ boy\ is\ spreading\ smiles)$ 은 아래와 같다.

$P(An\ adorable\ little\ boy\ is\ spreading\ smiles)$

$= P(An)\times P(adorable|An)\times P(little|An\ adorable)$

$\times P(boy|An\ adorable\ little)\times P(is|An\ adorable\ little\ boy)$

$\times P(spreading|An\ adorable\ little\ boy\ is)\times P(smiles|An\ adorable\ little\ boy\ is\ spreading)$

N-gram Language Model

statistical language model 통계적 언어 모델은 확률을 카운트에 기반하여 계산한다.

$P(is|An\ adorable\ little\ boy) = \frac{count(An\ adorable\ little\ boy\ is)}{count(An\ adorable\ little\ boy)}$

이런 방법은 코퍼스에 해당 시퀀스가 없는 경우를 해결하지 못한다. 'An adorable little boy is'라는 시퀀스가 없으면 해당 시퀀스는 확률이 0 이 되고, 'An adorable little boy'라는 시퀀스가 없다면 분모가 0 이 되어 확률이 정의되지 않는다.

이러한 경우는 문장이 길어질수록 많아진다. 예를 들면 코퍼스에 'I am a super cool adorable student'가 정확히 존재하는 것보다 'I am a student'가 존재할 가능성이 더 높다는 의미이다. 그렇다면 참고하는 단어들을 줄이면 존재할 확률이 높아진다고 생각할 수 있다. 즉, 'An adorable little boy is'가 존재할 확률보다 'little boy is'가 존재할 확률이 더 높다.

N-gram Language Model은 통계적 언어 모델의 ‘확률을 계산할 수 없는 경우’의 한계를 개선하기 위해 모든 단어를 고려하는 것이 아니라 일부 단어만 고려하는 방법이다. 이때 n은 고려하는 일부 단어의 개수를 의미한다.

unigrams : an, adorable, little, boy, is, spreading, smiles

bigrams : an adorable, adorable little, little boy, boy is, is spreading, spreading smiles

trigrams : an adorable little, adorable little boy, little boy is, boy is spreading, is spreading smiles

4-grams : an adorable little boy, adorable little boy is, little boy is spreading, boy is spreading smiles

n-gram 언어 모델에서 단어의 예측은 오직 해당 단어 이전의 n-1개의 단어에만 의존한다.

예를 들어 4-gram LM 일 때 'An adorable little boy is spreading' 다음에 나올 단어를 예측하고 싶다면,

4-1개인 ‘?’ 이전 3개의 단어만을 고려하여 아래와 같이 확률을 계산한다.

$P(w|boy\ is\ spreading) = \frac{count(boy\ is\ spreading\ w)}{count(boy\ is\ spreading)}$

만약 코퍼스에서 'boy is spreading'이 1,000번 등장, 'boy is spreading insults'가 500번, 'boy is spreading smiles'가 200번 등장했다면 각각의 확률은 아래와 같다.

$P(insults|boy\ is\ spreading) = 0.500\\ P(smiles|boy\ is\ spreading) = 0.200$

확률 값에 따라 선택한다면 'boy is spreading' 뒤에 등장할 단어는 'insults'가 된다.

N-gram Language Model의 한계

통계적 모델의 한계를 근본적으로 해결한 것이 아님
그래서 n을 몇으로 할 것인가?
- n이 클수록 모델의 성능을 높일 수 있을 것 → unigram보다 bigram, bigram보다 trigram이 보편적으로 성능이 더 좋음
  But! n이 계속 커진다면 모델의 계산량도 계속 증가하고 기존 모델의 한계에 근접해짐

본 포스팅은 딥러닝을 이용한 자연어 처리 입문을 참고하여 작성되었습니다.

'A.I. > NLP' 카테고리의 다른 글

[Paper Review] Efficient Estimation of Word Representations inVector Space ② \| Word Representation ③ Word2Vec (0)	2023.04.08
[Paper Review] Efficient Estimation of Word Representations inVector Space ① \| Word Representation ③ Word2Vec (0)	2023.03.24
Word Representation ② Local Representation (0)	2023.03.10
Word Representation ① Thesaurus 시소러스 (0)	2023.03.03

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

일단 저지르기

N-gram Language Model

Language Model 언어 모델

N-gram Language Model

N-gram Language Model의 한계

'A.I. > NLP' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

N-gram Language Model

Language Model 언어 모델

N-gram Language Model

N-gram Language Model의 한계

'A.I. > NLP' 카테고리의 다른 글

'A.I./NLP' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역