The incredibly famous Chinchilla paper changed the way we train LLMs. The authors - including the current Mistral CEO - outlined the scaling laws to maximise your model performance under a compute budget, balancing the number of parameters and training tokens.
Today, these heuristics are in jeopardy. LLaMA-3, for one, is trained on an unreasonable amount of tokens of text - but this is why it's so good. How much data do we actually need to train LLMs? This talk will shed light on the latest trends in model training and perhaps suggest newer scaling laws.