Over the past couple weeks I’ve been doing testing on an 8 x AMD MI300X node provided by Hot Aisle. I’ll have an article on some of my experiments training with MI300’s coming soon, but first, a deep dive into inferencing with vLLM.
There’s been plenty of inferencing testing on the MI300X done over the past few months, so my original goal was just to do some quick revalidation to gauge how rapidly the software stack has been maturing, as ROCm 6.2 is significantly improved, and vLLM v0.6 has had recent performance optimizations as well. I’m able to confirm that there have indeed been some big performance gains, but perhaps more interesting, during my testing I also ended up exploring/characterizing a few of the ways that you can tune vLLM for improved performance.
If you’re just perusing or short on time, you can jump straight to the conclusions for a brief summary. Also, just for fun (not accurate, just listening to the first few second, I notice they mention/fumble some of the benchmark numbers), but here’s an 11 minute NotebookLM Deep Dive Podcast summary as well: