Adam-mini: Use Fewer Learning Rates To Gain More

We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/\sqrt{v}$). We find that $\geq$ 90% of th

arxiv.org

https://github.com/zyushun/Adam-mini

GitHub - zyushun/Adam-mini: Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793

Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793 - zyushun/Adam-mini

github.com

Abstract

Adam-mini : AdamW보다 45% to 50% 적은 메모리를 사용하면서, 비슷하거나 더 나은 성능을 보여주는 Optimizer 제안
Adam에서 Learning Rate resource 를 줄여서 메모리 사용량 감소
- parameter를 block으로 분할 - Hessian structure
- 각각의 parameter block마다 하나의 Learning rate 할당

1. Introduction

Adam(W)는 가장 많이 사용되는 Optimizer, 그러나 Adam은 Cost가 크다.
- 1차 모멘텀 m, 2차 모멘텀 v : 총 모델 크기의 최소 2배의 메모리를 차지
- 모든 학습 매개변수에 대해 각각의 다른 learning rate를 적용
- 7B 모델 기준, Adam 만으로 m, v에 약 56GB의 메모리가 필요
- PaLM(540B 매개변수)의 경우 50개 이상의 GPU가 필요하기도 함
효과적인 Optimizer 설계의 필요성
- 메모리 감소 - 처리량을 높이고 학습 속도를 가속화
- 더 적은 GPU를 사용하여 비용과 에너지 절약
- LLM 학습의 Threshold 를 낮추어 제한된 GPU 자원으로도 더 많은 연구 참여 장려

Contribution

New Optimizer : Adam-mini
- Hessian Structure 에 따라 모델 매개변수를 분할
- 각 블록에서 Adam의 v 평균을 사용하여 단일 학습률을 선택

Lightweightness
- Adam에서 사용된 learning rates의 수를 크게 줄였음
- 주류 LLMs에서 90% 이상 감소, 45% ~ 50% memory cost 절감
Effectiveness
- 125M 에서 75B 크기의 다양한 언어 모델(pre-training 포함), supervised fine-tuning(SFT), reinforcement learning from human feedback(RLHF) 에서 AdamW 와 비슷하거나 더 나은 성능 달성
- Diffusion models, vision models, graph neural networks 등 non-LLM task에서도 더 나은 성능 달성
Efficiency
- AdamW 보다 더 높은 처리량
- 2 x A800-80GB에서 Llama2-7B pre-training 기준, AdamW 보다 49.6% 더 높은 처리량 달성, 33.1% 의 처리 시간 절약

Partition principle
- Adam-mini 에서 제안하는 매개변수 분할 전략
- Hessian 에서 가장 작은 Dense sub-block을 기준으로 매개변수 분할
- 분할된 블럭의 단일 학습률(but good)이 더 좋은 성능을 보여줄 수 있음
Hessian structure of Transformers
- 일반적인 분할 방법은 Transformer 에서는 훈련시 불안정, Partition principle 에 근거하여 Transformer 에 맞는 분할 전략을 적용
- heads 별로(Query and Key), 전체(Value and Projection) 그리고 레이어별(MLP)로 밀집 블록을 분할

2. Method

2.1. Method and Observations

Adam 에서 v는 각각의 파라미터에 학습률을 제공
- Transformers와 다양한 신경망의 Hessian은 거의 블록 대각선 형태에 가까움
- Transformers의 각 블록은 다른 고유값 분포를 가지고 있음
→ Transformers는 고유값의 이질성을 처리하기 위해 각 블록에 다른 학습률이 필요함

https://arxiv.org/abs/2402.16788 에서 각 블록마다 다른 학습률이 필요하다고 제안
Adam은 블록 뿐만 아니라 각 매개변수마다 다른 학습률을 할당함

block-diagonal Hessian 비교

Adam과 single-learning-rate에 대한 비교 - 경사하강법 수행

Figure 4. (a), (b) : Adam이 single-learning-rate 보다 뛰어남
Figure 4. (c), (d) : Dense sub-block에 대해서는 single-learning-rate가 Adam보다 성능이 뛰어남, (a)의 모든 서브 블록에 적용 가능
Figure 4. (b) 녹색선 : 블록 단위의 single-learning-rate를 적용한 경사하강법 적용

→ 각 Dense sub-block에 대해 단일 learning rate를 적용해도 더 나은 성능을 가져올 수 있다.

Adam can be viewed as a diagonal preconditioned method, but it may not be a good preconditioner and thus cannot effectively reduce the condition number of the dense sub-matrix

4-layers Transformer

the default partition by PyTorch 에 따라 임의로 하나의 매개변수 블록을 “left-out” 블록으로 선택
제외된 블록은 single-learning rate 로 변경, 나머지 블록은 Adam을 사용
single-learning rate는 grid-search, cosine decay schedule 적용
모든 제외된 블록들에 대하 Adam과 유사하거나 더 나은 Training Loss를 달성
제외된 블록, 즉 single-learning rate를 적용해야하는 블록이 많아질수록 grid-search에 대한 비용이 기하급수적으로 증가함
grid-search 없이 간단하게 single-learning rate를 찾아보자 : Adam-mini

2.2. Proposed Method: Adam-mini

Algorithm 1. “Adam-mini in Pytorch style”

Learning rate를 grid-search 없이 찾고 Adam의 learning rate resource를 줄이기

두 단계로 구성, Step 1은 Initialization 에만 필요

Step 1-1. 모델의 매개변수를 분할

Transformer는 Algorithm 2. 사용
- Algorithm 2. “Partition for Transformers”
모든 Query와 Key를 Head 별로 분할, 나머지는 기본 PyTorch partition
다른 네트워크의 경우 기본 PyTorch partition을 사용
- Algorithm 3. “Partition for non-Transformers”

Step 1-2. embd_blocks 선택

Algorithm 4. “Get_embd_blocks”
Transformers의 경우 embd_blocks는 Embedding layer와 output layer를 포함
다른 네트워크들은 선택된 매개변수 없음

Step 2. 각 매개변수 블록에 단일 학습률 사용

embd_blocks 를 제외한 각 매개변수 블록에 단일 학습률을 사용
적절한 학습률을 선택하기 위해 기본 Adam에서 $g ◦ g$ 를 평균값으로 대체.
평균값은 Adam에서와 같이 moving average 적용

Remark on the “embd_blocks”

“embd_blocks” 는 Transforers 에서 embedding layer와 output layer를 의미
2단계에서 제외한 이유
- “embd_blocks” 를 제거하면 Training 에서 불안정함을 발견 Figure 7. (b) Adam-mini (embd_blocks_removed), red line
- 데이터 미니배치에서 해당 토큰이 나타나지 않는 많은 행이 0이 됨
- 평균 연산 할 때, 편향된 학습률을 얻게 됨

2.3. Principle for the Partition Strategy

non-Transformer task 에서는 기본적인 분할 방법으로 잘 작동함. Transformer 에서는 불안정
Transformer에 다른 분할 알고리즘을 적용
- Query와 Key를 확인하면 헤드별로 더 작은 Hessian Block의 형태로 나눌 수 있음
- Value, attention projection, MLP는 더 작은 블록으로 분할 할 수 없음

2.4. Some Characteristics of Adam-mini

Memory cut down.

절약된 메모리의 비율은 모델 내 non-embedding parameters의 비율에 따라 달라짐
Llama2-7B에서는 96.2% 절약 됨, Llama3-70B의 경우 99.25% 의 비율
메모리 절약은 최대 45% 에서 50% 정도

Higher throughput.

Tensor 연산의 수를 크게 줄임
메모리 절약으로 인한 더 큰 배치 크기 지원
2 x A800-80GB에서 Llama2-7B pre-training 기준, AdamW 보다 49.6% 더 높은 처리량 달성, 33.1% 의 처리 시간 절약

Has room to improve.

Adam의 v 평균을 이용한 학습률이 최적이 아닐 수 있음

Some orthogonal combinations.

다른 메모리 효율이 좋은 Optimizer(GaLore, Sophia) 들과 결합하여 더 높은 메모리 절약과 처리량을 향상 시킬 수도 있음

3. Experiments

LLM 학습은 4개의 NVIDIA A800-80GB GPUs 사용
그 외에는 4개의 V100 GPUs 사용
AdamW, Adam-mini, Adafactor, CAME, SM3 비교

3.1. LLM Pre-training

3.2. Supervised Fine-tuning and RLHF

3.3. Non-LLM Tasks

5. Concluding Remarks

Adam보다 메모리를 45% to 50% 절약하는 새로운 Optimizer Adam-mini 를 제안했지만, 아직 개선할 여지가 많이 남아 있다!

저작자표시 비영리 변경금지 (새창열림)

'ML & DL > Paper Reviews' 카테고리의 다른 글

[Paper] 논문 쉽게 검색 및 파악하기 (1)	2024.06.15

RedMooN_MJ

[Review] Adam-mini: Use Fewer Learning Rates To Gain More

Abstract

1. Introduction

Contribution

2. Method