๐Ÿง  Machine Learning

๐Ÿฅซ [ELMo ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Deep contextualized word representations

์ด์œ  YIYU 2024. 5. 6. 12:42
โœจ ์ฒ˜์Œ ์‹œ์ž‘ํ•˜๋Š” ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋…ผ๋ฌธ ์ฝ๊ธฐ์˜ ์ฒซ ๊ธ€์ž…๋‹ˆ๋‹ค. ์ด๋ฒˆ ๊ธ€์—์„œ ๋‹ค๋ฃฐ ๋…ผ๋ฌธ์€ 2018๋…„์— ๊ฒŒ์žฌ๋œ Deep Contextualized Word Representations์ž…๋‹ˆ๋‹ค. ELMo๋ผ๋Š” ์ด๋ฆ„์„ ๊ฐ€์ง„ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

1. Backgrounds

์‚ฌ์ „ํ•™์Šต๋œ word representations์€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž‘์—…์— ์‚ฌ์šฉ๋˜๊ธฐ ์ „ ๋Œ€๊ทœ๋ชจ ์ฝ”ํผํŠธ์—์„œ ํ•™์Šตํ•œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์˜ ์ผ์ข…์ด๋‹ค. ์ด๋Ÿฌํ•œ ํ‘œํ˜„์€ ๋งŽ์€ neural language ๋ชจ๋ธ์—์„œ ์ค‘์š”ํ•œ ๊ตฌ์„ฑ ์š”์†Œ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ณ ํ’ˆ์งˆ์˜ representation์„ ํ•™์Šตํ•˜๋Š”๋ฐ 2๊ฐ€์ง€ ์–ด๋ ค์›€์ด ์žˆ๋‹ค.

  1. ๊ตฌ๋ฌธ ๋ฐ ์˜๋ฏธ๋ก ๊ณผ ๊ฐ™์€ ๋‹จ์–ด ์‚ฌ์šฉ์˜ ๋ณต์žกํ•œ ํŠน์„ฑ์„ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ฒƒ
  2. ๋‹ค์˜์–ด์™€ ๊ฐ™์ด ๋ฌธ๋งฅ์— ๋”ฐ๋ผ ์˜๋ฏธ๊ฐ€ ๋ณ€ํ˜•ํ•˜๋Š” ๊ฒƒ

ํ•˜๋‚˜์”ฉ ์‚ดํŽด๋ณด์ž๋ฉด,

๊ตฌ๋ฌธ ๋ฐ ์˜๋ฏธ๋ก ์ด๋ž€ ์˜๋ฏธ๋กœ๋Š” ์‹œ์ œ๋ฅผ ์˜ˆ๋กœ ๋“ค ์ˆ˜ ์žˆ๋‹ค. I read a book yesterday.์—์„œ read๋Š” ์ฝ์—ˆ๋‹ค๋ผ๋Š” ์˜๋ฏธ์ด๋‹ค. I will read a book today.์˜ read๋Š” ์ฝ์„ ๊ฒƒ์ด๋‹ค๋ผ๋Š” ์˜๋ฏธ์ด๋‹ค. ์šฐ๋ฆฌ๋Š” ์–ด๋–ป๊ฒŒ ๊ฐ™์€ ๋‹จ์–ด์ธ๋ฐ read๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ•ด์„ํ• ๊นŒ? ๋’ค์— ์žˆ๋Š” yesterday์™€ today๋ฅผ ํ†ตํ•ด ์œ ์ถ”ํ–ˆ์„ ๊ฒƒ์ด๋‹ค.

๋‹ค์˜์–ด๋กœ๋Š” ๊ฐ™์€ ๋‹จ์–ด์ด์ง€๋งŒ ์ „ํ˜€ ๋‹ค๋ฅธ ์˜๋ฏธ๋กœ ์“ฐ์ด๋Š” ๊ฒฝ์šฐ๋ฅผ ๋งํ•œ๋‹ค. capital gain์˜ capital์€ ์ž๋ณธ์ด๋ž€ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š”๋‹ค. capital city์˜ capital์€ ์ˆ˜๋„์˜ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š”๋‹ค.

์ด๋ ‡๊ฒŒ ๊ฐ™์€ ๋‹จ์–ด๋ผ๋„ ๋‹ค๋ฅด๊ฒŒ ์“ฐ์ด๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์ง€๋งŒ word embedding์˜ ๊ฒฝ์šฐ, ์ด๋Ÿฐ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ด์ฃผ์ง€ ๋ชปํ•œ๋‹ค.

2. Related work

ํ•˜์œ„ ๋‹จ์–ด ์ •๋ณด๋กœ ๊ฐ•ํ™”ํ•˜๊ฑฐ๋‚˜ ๊ฐ ๋‹จ์–ด ์˜๋ฏธ์— ๋Œ€ํ•ด ๋ณ„๋„์˜ ๋ฒกํ„ฐ๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๋˜ ๋‹ค๋ฅธ ์—ฐ๊ตฌ๋กœ context-dependent representation์— ์ง‘์ค‘ํ•œ context2vec, pivot ๋‹จ์–ด๋ฅผ representation์— ํฌํ•จํ•œ supervised neural machine translation๊ณผ unsupervised language model๋“ค์ด ์žˆ๋‹ค. ELMo ๋ชจ๋ธ์€ ์ด ์—ฐ๊ตฌ๋“ค์˜ ์žฅ์ ์„ ์ตœ๋Œ€ํ•œ ํ™œ์šฉํ•˜๋ฉฐ ์•ฝ 30๊ฐœ๋กœ ๊ตฌ์„ฑ๋œ corpus์—์„œ biLM์„ ํ•™์Šตํ•œ๋‹ค.

3. ELMo: Embeddings from Language Models

ELMo๋Š” ๋‹จ์–ด๋ฅผ ๊ณ ์ • ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ธฐ์กด ๋‹จ์–ด embedding๊ณผ ๋‹ค๋ฅด๊ฒŒ ์ „์ฒด ๋ฌธ์žฅ์„ ์ž…๋ ฅ์œผ๋กœ ๊ณ ๋ คํ•˜๊ณ  ๊ตฌ๋ฌธ ๋ฐ ์˜๋ฏธ๋ก ๊ณผ ๊ฐ™์€ ๋‹จ์–ด ์šฉ๋„์— ๋ณต์žกํ•œ ํŠน์„ฑ์„ ํฌ์ฐฉํ•œ๋‹ค. ์ด๋Š” ๊ธ€์ž์ˆ˜์ค€ ํ•ฉ์„ฑ๊ณฑ์„ ํฌํ•จํ•˜๋Š” 2๊ณ„์ธต biLM์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์‚ฐ๋˜๋ฉฐ, biLM์˜ ๋‚ด๋ถ€์—์„œ๋Š” ELMo word representation์„ ๊ณ„์‚ฐํ•˜๋Š” ์„ ํ˜• ํ•จ์ˆ˜๋กœ ์‚ฌ์šฉ๋œ๋‹ค. ์ด๋กœ ์ธํ•ด ์‰ฝ๊ฒŒ ๋‹ค๋ฅธ NLP ๋ชจ๋ธ์— ๋ถ™์ด๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

3.1 Bidirectional language models(biLM)


N๊ฐœ์˜ token(\(t_1, t_2, ..., t_N\))์ด ์žˆ๋‹ค๊ณ  ํ• ๋•Œ, Forward Language Model์€ \((t_1,t_2, ..., t_{k_1})\)์ด ์ฃผ์–ด์กŒ์„ ๋•Œ \(t_k\)๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•œ๋‹ค.

๋ฌธ์žฅ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, ๋‹จ์–ด๋Š” character ์ž„๋ฒ ๋”ฉ์œผ๋กœ representation๋œ ๋’ค, ์ฒซ LSTM ์…€๋กœ ์ž…๋ ฅ๋œ๋‹ค. character ์ž„๋ฒ ๋”ฉ์œผ๋กœ ์ „ํ™˜๋˜๋Š” ์ด์œ ๋Š” 2๊ฐ€์ง€์ด๋‹ค.

  • ์ตœ์ดˆ ์ž„๋ฒ ๋”ฉ์€ ๋ฌธ๋งฅ์˜ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š์•„์•ผ ํ•œ๋‹ค.
  • ์„ ํ–‰ ํ•™์Šต๋œ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ๊ณผ ๋น„๊ตํ•˜๊ธฐ ์œ„ํ•ด Glove๋‚˜ Word2Vec๊ณผ ๊ฐ™์€ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜๋‹ค.

์ž…๋ ฅ๋‹จ์—์„œ ๋ฌธ๋งฅ์— ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š์•˜์ง€๋งŒ layer๋ฅผ ์ง€๋‚˜์น ์ˆ˜๋ก ๋ฌธ๋งฅ์— ์˜ํ–ฅ์„ ๋ฐ›๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ค. ์ฒซ LSTM ์ถœ๋ ฅ์€ char ์ž„๋ฒ ๋”ฉ๊ณผ residual connection์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. residual connection์€ 2๊ฐ€์ง€ ์žฅ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

  • ์ƒ์œ„ layer๋“ค์ด ํ•˜์œ„ layer์˜ ํŠน์ง•์„ ์žƒ์ง€ ์•Š์•„์•ผ ํ•œ๋‹ค.
  • gradient descent๋ฅผ ํ†ตํ•œ gradient vanishing ํ˜„์ƒ์„ ๊ทน๋ณตํ•˜๋„๋ก ๋„์™€์ค€๋‹ค.

L๊ฐœ์˜ layer์— ์ „๋‹ฌ ํ›„ softmax layer๋กœ ๋‹ค์Œ token์„ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.

Backward Language Model์€ Forward Language Model๊ณผ ์‹์˜ ํ˜•ํƒœ๋Š” ๊ฐ™์ง€๋งŒ ๋’ค์ชฝ token์„ ์‚ฌ์šฉํ•ด ์•ž์˜ token์˜ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•œ๋‹ค.

biLM์€ ๋‘ ๋ฐฉํ–ฅ์˜ log likelyhood๋ฅผ ์ตœ๋Œ€ํ™” ์‹œํ‚ค๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.

\(\Theta_x\)๋Š” token representation, \(\Theta_s\)๋Š” softmax layer๋กœ Forward, Backward์—์„œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณต์œ ํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์˜ ๋ณต์žก์„ฑ์„ ์ค„์ด๊ณ  ํ•™์Šต์„ ๊ฐœ์„ ํ•œ๋‹ค.

3.2 ELMo


๋‘ LSTM(Forward, Backward)์˜ layer representation์˜ ๊ฒฐํ•ฉ์ด๋‹ค. ๊ฐ ํ† ํฐ \(t_k\)์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋กœ L-layer biLM๊ณผ token layer๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋ฉฐ 2L+1๊ฐœ์˜ representation์œผ๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค.

๋จผ์ €, Forward, Backward LM layer๋ฅผ ๊ฐ layer๋ณ„๋กœ concatenateํ•œ ๋’ค์— ๊ฐ€์ค‘์น˜๋ฅผ ์ค€ ๋’ค ๋”ํ•œ ํ›„ scalar parameter \(\gamma^{task}\)๋กœ ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋ฅผ ์กฐ์ ˆํ•˜๋ฉฐ ์ตœ์ ํ™” ๊ณผ์ •์—์„œ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค.

3.3 Using biLMs for supervised NLP tasks


biLM์„ ์‚ฌ์šฉํ•˜์—ฌ task model์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด์„  ์•„๋ž˜์™€ ๊ฐ™์€ ๊ณผ์ •์„ ์ง„ํ–‰ํ•œ๋‹ค.

  1. ๋ชจ๋“  ๋‹จ์–ด์˜ layer representation์„ ๊ธฐ๋กํ•œ๋‹ค.
  2. \((t_1, t_2, ..., t_k)\)์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ๋‹จ์–ด embedding ๋˜๋Š” ์„ ํƒ์ ์œผ๋กœ character ๊ธฐ๋ฐ˜์˜ representation์„ ์‚ฌ์šฉํ•œ๋‹ค.๊ทธ๋Ÿฌ๋ฉด context์— ๋ฏผ๊ฐํ•œ representation \(h_k\)๊ฐ€ ํ˜•์„ฑ๋œ๋‹ค.

supervised model์— ELMo๋ฅผ ์ถ”๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด์„ 

  1. biLM์˜ ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ ์ •์‹œํ‚จ๋‹ค.
  2. ELMo vector \(ELMo_k^{task}\)์™€ \(x_k\)๋ฅผ concatenateํ•œ๋‹ค.
  3. task RNN์— ์ „๋‹ฌํ•œ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ, ELMo์— ์ ๋‹นํ•œ ์–‘์˜ dropout์„ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ํšจ๊ณผ์ ์ด์˜€์œผ๋ฉฐ ์ผ๋ถ€์˜ ๊ฒฝ์šฐ๋กœ๋Š” loss์— \(\lambda||w||^2_2\)๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ELMo ๊ฐ€์ค‘์น˜๋ฅผ ์ •๊ทœํ™”ํ•˜๋Š” ๊ฒƒ์ด ELMo ๊ฐ€์ค‘์น˜๊ฐ€ ๋ชจ๋“  biLM ๋ ˆ์ด์–ด์˜ ํ‰๊ท ์— ๊ฐ€๊น๋„๋ก ์œ ๋„๋˜์–ด ์œ ์šฉํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป๋Š”๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ๋‹ค.

3.4 Pre-trained bidirectional language model architecture


์ด ๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉ๋œ pre-trained biLM์€ ozefowicz et al. (2016)๊ณผ Kim et al. (2015)์˜ ์•„ํ‚คํ…์ฒ˜์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ ์–‘๋ฐฉํ–ฅ ํ•™์Šต๊ณผ LSTM layer ๊ฐ„์˜ residual connection์„ ์ถ”๊ฐ€ํ•˜์˜€๋‹ค.

character ๊ธฐ๋ฐ˜ ์ž…๋ ฅ representation์„ ์œ ์ง€ํ•˜๋ฉด์„œ ์ „์ฒด language model์˜ ๋ณต์žก์„ฑ๊ณผ downstream ์ž‘์—…์— ๋Œ€ํ•œ ๊ณ„์‚ฐ ์š”๊ตฌ๋Ÿ‰์„ ๋งž์ถ”๊ธฐ ์œ„ํ•ด CNN-BIG-LSTM์˜ ๋ชจ๋“  ์ž„๋ฒ ๋”ฉ ๋ฐ hidden dimension์„ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์˜€๋‹ค.

๊ทธ ๊ฒฐ๊ณผ 4096๊ฐœ์˜ unit๊ณผ 512 ์ฐจ์›์˜ projection layer, 1๋ฒˆ์งธ layer์—์„œ 2๋ฒˆ์งธ layer๋กœ์˜ residual connection์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋ฌธ๋งฅ์— ๋ฌด๊ด€ํ•œ representation์€ 2048๊ฐœ์˜ ๋ฌธ์ž n-gram convolution filter๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์ด๋ฅผ 2๊ฐœ์˜ highway layer์™€ 512 representation์˜ linear projection์ด ์ด์–ด์ง„๋‹ค.

๊ทธ ๊ฒฐ๊ณผ, biLM์€ ๋ฌธ์ž ์ž…๋ ฅ์œผ๋กœ ์ธํ•ด ํ•™์Šต์…‹์„ ๋ฒ—์–ด๋‚˜๋Š” ๊ฒƒ์„ ํฌํ•จํ•œ 3๊ฐœ์˜ representaion layer๋ฅผ ์ œ๊ณตํ•œ๋‹ค.


์ด๋ฒˆ ๊ธ€์—์„œ๋Š” 2018๋…„์— ๊ฒŒ์žฌ๋œ ELMo ๋…ผ๋ฌธ์„ ์‚ดํŽด๋ณด์•˜์Šต๋‹ˆ๋‹ค. ๋ฌธ๋งฅ์„ ์ดํ•ดํ•˜๋Š” sentence representation์— ๋Œ€ํ•ด ์ƒˆ๋กญ๊ฒŒ ์•Œ๊ฒŒ ๋˜์–ด ํฅ๋ฏธ๋กœ์› ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ์ž์ฃผ ๋“ฑ์žฅํ•˜์ง€ ์•Š๋Š” ๋‹จ์–ด๋“ค์ด ๋ฌธ๋งฅ์„ ํฌํ•จํ•˜์—ฌ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๊ถ๊ธˆํ•ด์ง€๋„ค์š”. ๊ธด ๊ธ€ ์ฝ์–ด์ฃผ์…”์„œ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค!๐Ÿฅ