Apple M3 Machine Learning Speed Test

I put my M1 Pro against Apple's new M3, M3 Pro, M3 Max, a NVIDIA GPU and Google Colab.

A workspace with open laptops with graphs on their screens, also displaying "M1 Pro," "M3," "M3 Pro," and "M3 Max" logos.

I've been using my M1 Pro MacBook Pro 14-inch for the past two years.

I bought the upgraded version with extra RAM, GPU cores and storage to future proof it.

And it hasn't missed a beat.

But Apple's latest release of the M3 series got me curious.

I watched the presentation and saw a bunch of graphs about it being their biggest GPU performance leap in years.

Being a machine learning engineer, naturally, this got me curious about how they would perform from a machine learning standpoint.

My M1 Pro is unmatched in day-to-day usage.

I love it.

But I wouldn't go training larger scale machine learning models on it.

Can the M3 series change this?

I did a bunch of tests to find out.

Resources

  • Code on GitHub – all of the code I used to setup and run the tests across the machines can be found on GitHub.
  • Video walkthrough – I also made a video walkthrough of all the results plus a few tips and recommendations on YouTube.

Machines we're testing

The following are the machines I tested.

For all of the M3 variants of the MacBook Pro, they were the base model in their class (e.g. an M3 Pro MacBook Pro with no upgrades from the Apple Store).

Machine CPU GPU RAM Storage Price (USD)
M1 Pro 14" 2021 10-core CPU 16-core GPU 32GB 4TB SSD ~$3,500
M3 14" 2023 8-core CPU 10-core GPU 8GB 512GB SSD $1,599
M3 Pro 14" 2023 11-core CPU 14-core GPU 18GB 512GB SSD $1,999
M3 Max 14" 2023 14-core CPU 30-core GPU 36GB 1TB SSD $3,199
Deep Learning PC Intel i9 NVIDIA TITAN RTX (24GB) 32GB 1TB SSD ~$3,000
Google Colab Free Tier 2-core CPU NVIDIA Tesla V100 (16GB) 12GB 100GB SSD Free or $10/month for more compute

Tests we're performing

You can find all of the code for the following tests on GitHub.

Experiment Model Dataset Num Samples Problem Type Backend
1 ResNet50 (CNN) CIFAR100 50,000 train, 10,000 test Image Classification PyTorch
2 ResNet50 Food101 75,750 train, 25,250 test Image Classification PyTorch
3 DistilBERT (Transformer) IMDB 25,000 train, 25,000 test Text Classification PyTorch
4 ResNet50 CIFAR100 50,000 train, 10,000 test Image Classification TensorFlow
5 ResNet50 Food101 75,750 train, 25,250 test Image Classification TensorFlow
6 SmallTransformer IMDB 25,000 train, 25,000 test Text Classification TensorFlow
7 Llama 2 7B Q4_0.gguf N/A 100 (generate 100 answers) Text Generation llama.cpp
8 (bonus) Geekbench ML Multiple Multiple Multiple Core ML

Notes:

  • Only training time was measured as this generally takes far more time than inference (except for Llama 2 text generation and Geekbench ML, these were inference only).
  • If a result isn't present for a particular machine (in the graphs below), it means it either failed or didn't have enough memory to complete the test (e.g. M3 Pro 14" 2023 with 8GB RAM couldn't run batch size 64 for PyTorch CV Food101).
  • All training was done in float32, as because as far as I know, mixed precision training isn't available on M-series chips with PyTorch/TensorFlow. This means that if it was, training times would likely be almost halved.
  • All experiments focus on measuring speed only and not accuracy/performance.

Results

Each of the following results were recorded in December 2023.

So they may change as frameworks update and can leverage hardware better.

Major versions:

  • PyTorch 2.1.0, TorchVision 0.16.0
  • TensorFlow 2.15.0
  • Transformers 4.35.2

Full results can be viewed in the results/ directory on GitHub.

1 - PyTorch Computer Vision (CIFAR100)

Model Dataset Image Size Epochs Num Samples Num Classes Problem Type
ResNet50 CIFAR100 32x32x3 5 50,000 train, 10,000 test 100 Image Classification
A bar chart showing the average time per epoch in seconds for training a PyTorch ResNet50 model on the CIFAR100 dataset with different batch sizes. The processors compared are Apple M1 Pro, Apple M3, Apple M3 Pro, Apple M3 Max, NVIDIA TITAN RTX, and Tesla V100. As the batch size increases from 16 to 1024, the time per epoch generally decreases for each processor, with the Apple M1 Pro taking the longest and the NVIDIA TITAN RTX and Tesla V100 performing the best. The chart is titled "PyTorch ResNet50 CIFAR100 Dataset: Batch Size vs Time per Epoch (lower is better)".
Figure 1: Average time per training epoch across 5 epochs with different batch sizes for ResNet50 on CIFAR100 with PyTorch backend.

A trend across almost all experiments (and in practical ML setups) is that you generally get faster performance the more you pack the GPU.

In essence, with a lower batch size and a small dataset (CIFAR100 is only 32x32 images), much of the time is spent moving data around rather than computing on it.

From Figure 1 we can see that as the batch size increases, the average time per epoch goes down (until it saturates at 256 and above).

And then many of the M-series chips level out in terms of time per epoch. This is strange as I thought there would've been more of a difference across different numbers of GPU cores/age of the chips.

The NVIDIA chips have the best performance by a long shot across all batch sizes.

2 - PyTorch Computer Vision (Food101)

Model Dataset Image Size Epochs Num Samples Num Classes Problem Type
ResNet50 Food101 224x224x3 5 75,750 train, 25,250 test 101 Image Classification
Bar chart comparing the average time per epoch for the PyTorch ResNet50 model on the Food101 dataset across various batch sizes for different processors, including Apple M1 Pro, Apple M3, Apple M3 Pro, Apple M3 Max, NVIDIA TITAN RTX, and Tesla V100.
Figure 2: Average time per training epoch across 5 epochs with different batch sizes for ResNet50 on Food101 with PyTorch backend.

The numbers get closer on a larger dataset.

Food101 image tensors (224x224x3) have ~49x more elements in them than CIFAR100 image tensors (32x32x3).

224x224x3 is also the current image size I use to train the computer vision models that power Nutrify (an app my brother I have built to help people learn about food).

So the Food101 dataset with 100,000+ images is closer to a real world experiment.

It’s clear when filling up the GPUs with data, the speed gap narrows between the M3 Max (30 core GPU) and the NVIDIA GPUs.

A larger dataset is also where we see the RAM of the M3 (8GB) and M3 Pro (18GB) maxing out (out of memory) with larger batch sizes.

3 - PyTorch Natural Language Processing (NLP)

Model Dataset Sequence Size Epochs Num Samples Num Classes Problem Type
DistilBERT (fine-tune top 2 layers + top Transformer block) IMDB 512 5 25,000 train, 25,000 test 2 Text Classification
Bar graph showing the number of samples processed per second by a PyTorch DistilBert model on the IMDB dataset, comparing different batch sizes for processors such as Apple M1 Pro, Apple M3, Apple M3 Pro, Apple M3 Max, NVIDIA TITAN RTX, and Tesla V100, with higher values indicating better performance.
Figure 3: Fine-tuning the top 3 layers on a DistilBERT model from Hugging Face Transformers on the IMDB dataset.

DistilBERT is a modern NLP neural network.

And fine-tuning the last few layers of a network for a specific task is a very common workflow.

This test measured samples per second where higher is better.

The results show here that more GPU cores is better.

With results following closely to the number of cores for the M-series chips.

Notably, my two-year-old M1 Pro outperformed the brand new M3 Pro.

This is likely due to my M1 Pro having 2 more GPU cores than M3 Pro (16 vs 14).

Though no M-series chip was close to the performance of the NVIDIA chips.

Finally, it’s clear that when training larger models like DistilBERT, memory (RAM or VRAM), as the M3 failed to complete an epoch with a batch size of 32 or above (out of memory).

For the highest batch sizes, only the machines with the largest amount of RAM were about to complete them. So if you want to train or use larger models, it's clear you'll want more RAM.

4 - TensorFlow Computer Vision (CIFAR100)

Model Dataset Image Size Epochs Num Samples Num Classes Problem Type
ResNet50 CIFAR100 32x32x3 5 50,000 train, 10,000 test 100 Image Classification
Bar chart illustrating the average time per epoch in seconds for the TensorFlow ResNet50 model on the CIFAR100 dataset at various batch sizes, comparing the performance of processors like Apple M1 Pro, Apple M3, Apple M3 Pro, Apple M3 Max, NVIDIA TITAN RTX, and Tesla V100.
Figure 4: Average time per training epoch across 5 epochs with different batch sizes for ResNet50 on CIFAR100 with TensorFlow backend.

We see the same trend again with the TensorFlow backend on CIFAR100.

Average time per epoch is greater with lower batch sizes.

Again, likely because with such small batch sizes and data samples the majority of time is spent moving data around than actually computing on it.

Time per epoch steadily decreases as batch size increases.

And we see an increasing performance with more GPU cores across the M-series.

Notably again, the M1 Pro keeps the pace with the M3 Pro and outperforms the M3 across almost every batch size.

5 - TensorFlow Computer Vision (Food101)

Model Dataset Image Size Epochs Num Samples Num Classes Problem Type
ResNet50 Food101 224x224x3 5 75,750 train, 25,250 test 101 Image Classification
Bar chart showing the average time per epoch in seconds for the TensorFlow ResNet50 model training on the Food101 dataset across different batch sizes, featuring performance comparisons among Apple M1 Pro, Apple M3, Apple M3 Pro, Apple M3 Max, NVIDIA TITAN RTX, and Tesla V100 processors.
Figure 5: Average time per training epoch across 5 epochs with different batch sizes for ResNet50 on Food101 with TensorFlow backend.

This experiment saw some of the closest results across the board.

With exception for the M3, most machines were within 100% performance of each other.

The M1 Pro with 16 GPUs also outperformed the M3 (10 core GPU) and M3 Pro (14 core GPU) across all batch sizes.

The M3 Max (30 core GPU) also closed the gap between the NVIDIA cards.

However, both NVIDIA cards shined when utilising all available cores and memory thanks to the larger data size.

6 - TensorFlow Natural Language Processing (NLP)

Model Dataset Sequence Size Epochs Num Samples Num Classes Problem Type
SmallTransformer (custom) IMDB 200 5 25,000 train, 25,000 test 2 Text Classification
Bar chart comparing the average training time per epoch for the TensorFlow SmallTransformer model on the IMDB dataset at varying batch sizes, showcasing the performance of Apple M1 Pro, Apple M3, Apple M3 Pro, Apple M3 Max, NVIDIA TITAN RTX, and Tesla V100 processors.
Figure 6: Average time per training epoch across 5 epochs with different batch sizes for SmallTransformer on IMDB with TensorFlow backend.

For this experiment I coded up my own SmallTransformer model (a neural network with a single transformer block).

Figure 6 shows one of the clearest trends amongst the experiments.

M-series chips performed better as they got newer and larger in terms of GPU cores (M3 Max > M3 Pro > M3 > M1 Pro).

And once again, the NVIDIA chips performed far better than the rest of the machines, sometimes 8-9x better.

7 - Llama 2 7B Text Generation

Model Task Num Questions Num Answers Total Generations Backend
Llama 2 7B .gguf format Text Generation 20 5 20*5 = 100 llama-cpp-python
Bar graph showing the rate of token generation per second by the llamaccp llama-2-7b-chat Q4_0.9guf on different Apple processors, with the Apple M1 Pro, Apple M3, Apple M3 Pro, and Apple M3 Max. The Apple M3 Max shows the highest token generation rate, indicating better performance.
Figure 7: Average number of tokens generated per second by a Llama 2 7B model in .gguf format across 100 generation tasks (20 questions, 5 times each) using llama-cpp-python backend.

With new formats like .gguf, LLMs are getting easier and easier to use on device.

And it’s not inconceivable that within the next 12 months (I’m writing this in December 2023) we’ll see ChatGPT-level performance from models running purely on device, no internet connection required.

One of the main metrics to measure for current LLMs is tokens generated per second (as in, how many words per second can the model output when you ask it a question).

In this case, higher is better.

And here the trend was again, more GPU cores means higher performance.

The M1 Pro outperformed the M3 and M3 Pro but the M3 Max with 30 GPU cores pulled out in front.

However, something to note is that for me, anything over 25-30 tokens per second is above reading speed.

So of course whilst metrics of higher tokens per second would be better, it’s important to note that 35 tokens per second (M1 Pro) and ~48 tokens per second (M3 Max) is definitely more than usable from a practical standpoint.

8 - Geekbench ML

I've seen Geekbench scores for various chips over the years.

But this is the first time I've tried Geekbench ML.

So I was excited to see the outcomes.

All tests were done using Geekbench ML 0.6.0 for Mac.

Tests include a series of inference-only benchmarks across different domains.

All machines have a 16-core Neural Engine.

Machine Num CPU cores CPU CPU-link Num GPU Cores GPU GPU-link Neural Engine Neural Engine-link
MacBook Pro M1 Pro 14 inch, 2021 10 1809 Link 16 5192 Link 6462 Link
MacBook Pro M3 14 inch, 2023 8 2356 Link 10 5747 Link 8399 Link
MacBook Pro M3 Pro 14 inch, 2023 11 2355 Link 14 7030 Link 10237 Link
MacBook Pro M3 Max 14 inch, 2023 14 2393 Link 30 9008 Link 9450 Link

Notably, the M3 outperforms the M1 Pro in the Geekbench ML scores, however, in practice, it seems the M1 Pro can perform on par or even outperform the M3.

As for the neural engine, I'm not 100% sure why the M3 Pro performs the best in comparison to the M3 Max. I tested these two several times and recorded the highest score for each.

I guess in practice, you won't notice this difference though.

As far as I know, the neural engine kicks in for inference tasks and there is still a bit of black magic behind how and when it actually works.

Discussion

It's quite clear that the newest M3 Macs are quite capable of machine learning tasks.

However, dedicated NVIDIA GPUs still have a clear lead.

The results also show that more GPU cores and more RAM equates to better performance (e.g. M3 Max outperforming most other Macs on most batch sizes).

An interesting result was that the M3 base chip outperformed (or performed level with) the M3 Pro and M3 Max on smaller-scale experiments (CIFAR100, smaller batch sizes).

I'm not 100% sure why this is the case but my intuition tells me this is likely because the overhead of copying data to and from the GPU is more expensive than the actual training itself (e.g. the GPU is waiting for data to be copied to it, rather than being fully utilised).

So in practice, the M3 can compete with M3 Pro and M3 Max because the actual computation doesn't take long but the copying does.

Either way, the Food101 examples show a more realistic example with larger image sizes. It's here that the machines with more GPU cores perform faster and the machines with more RAM can handle larger batch sizes.

For the best results, you'll want to always pack as much data into the GPU as possible (to use all of your GPU cores) and avoid copying data between memory.

I thought that the unified memory system on the M-series chips would reduce copying overheads. Perhaps this is not yet the case from a software perspective (e.g. PyTorch and TensorFlow are not designed for Apple Silicon).

Maybe newer frameworks designed for Apple Silicon such as MLX will better utilise the unified memory system. This will require further investigation.

I was also very impressed by the performance of the M1 Pro I bought 2 years ago (with upgrades). It was able to outperform the base M3 and could keep up with or perform better than the M3 Pro on almost all benchmarks.

It seems Apple may have gone too hard on the M1 chips, as they are still performing outstandingly well two years later.

Finally, all experiments were conducted in float32. And as far as I know, float16 (half-precision) training isn't yet possible on the M-series chips with TensorFlow/PyTorch. Training in float16 would definitely see the NVIDIA GPUs pull even further ahead (and subsequently I'd assume the same for Apple Silicon Macs once it becomes available).

The Geekbench ML results were as expected (newer and bigger chips doing better) with the exception of the M3 Max performing slightly worse on the Neural Engine than the M3 Pro. However, I'd take this number with a grain of salt as it will likely be close to unnoticed in real-world applications.

Recommendations

For smaller experiments, fine-tuning models and learning the fundamentals of machine learning, the M3 Macs will be more than fine to use.

But for larger scale workloads, you'll likely still want a dedicated NVIDIA GPU.

Personally, I use my M1 MacBook Pro as a daily driver but perform all larger-scale deep learning experiments on my NVIDIA GPU PC (connected via SSH).

For example, I do plenty of data exploration for Nutrify but all model training happens on a NVIDIA TITAN RTX.

And Google Colab helps to fill in the gaps whenever necessary.

Based on the results across the new M3 Macs, I'm not going to upgrade my M1 MacBook Pro.

But I am curious to see how a spec'd up M3 Max (or future M3 Ultra) would go with a dedicated MLX model against my NVIDIA GPU PC.

In summary my recommendations are:

  • Go for as much RAM and GPU cores as you can afford, typically in that order.
    • More GPU cores = faster training/inference.
    • More RAM = larger batch sizes/models.
  • Avoid the 8GB RAM M3, 16GB is a good minimum.
    • As value for money, the M3 Pro with a RAM upgrade (16GB -> 36GB) and GPU upgrade (14-cores -> 18 cores) still comes in cheaper than an M3 Max.
  • If you've got the option, perhaps spend less on a MacBook and buy a dedicated NVIDIA GPU and setup a deep learning PC you can SSH into (this is what I do).
    • For example, get the baseline M3 with a RAM upgrade and spend the rest of the money on a NVIDIA GPU.

See the GitHub for more notes on the experiments and the video walkthrough for a more visual overview.