🌬️ Au-gust of wind has brought you another Climbuddy update! 🌬️

This month has been solely focused on hammering down the hold detection pipeline, which is arguably the most important part of the whole process, and without which it just doesn’t work. We’ve mainly focused on running benchmarks to find the right configuration for our use case.

So, without further ado, let’s get into it.

App

All focus is on the model generation pipeline, so not much has transpired. 🏜️

One thing that we’ve done is to determine what the next update will be focused on – the currently entirely broken admin UI. We’ve been largely focusing on the user’s experience while neglecting the admin interface (the only instance is currently managed by us), which needs to be fixed.

The release date is currently set for 10. 2024, we’ll see if we manage to do it in time.

Tech

Hardware Upgrade

We got new toys!

Previously, everything ML-related (including training) was done on an NVIDIA 3060 TI. This was okay for smaller models, but made using and training larger models very difficult due to only having 8GB VRAM.

For that reason, we chose to get an NVIDIA 4090 (24 GB of VRAM), and the difference is night and day; all of the benchmarks seen below are courtesy of this new setup. 🎉.

Getting SAM results

After a decent amount of messing around (as seen in the past two posts), we have some concrete answers of whether SAM is a useful tool for improving the quality of the datasets.

Short answer: YES, very much so.

The long answer is that it is when you use the right prompts – since SAM could be prompted with a combinations of positive/negative points, bounding boxes and low-res segmentation masks, it’s not immediately obvious which combination produces the best results.

Furthermore, since there are many versions of SAM, both in terms of weights (SAM-b / SAM-l) and implementations (SAM 2 came out a few weeks ago!), we also have to make sure pick the best one.

Here are the results that we got on our Crimp dataset, testing all combinations of points/masks/boxes on MobileSAM, SAM-b, SAM-l, SAM2-b, SAM2-l:

Model + Prompts IoU of GT
base (projection from 3D model vs. GT) 85.3%
SAM-l with bbox+points 96.2%
SAM-l with points+bbox+mask 96.3%
SAM-l with bbox+mask 96.5%
SAM-l with bbox 96.5%

The best configuration turns out to be SAM-l with bounding box-only prompts (with the other methods and models trailing behind, which is a bit surprising since SAM 2 should theoretically perform better (on paper, that is).

We may revisit this in the future when we gather more data, but the results are good enough for now 🙂.

YOLOptimization

Having a nicely SAM-improved dataset, we can tackle the hold detection training. We are using YOLOv8 and YOLOv9, because they offer state-of-the-art image segmentation, have a copy-left license and are very simple to use.

Pre-trained vs. Ground-up

This one was pretty easy to answer, but still a good sanity check – should we use the pre-trained weights, or start training from scratch?

precision (M)
recall (M)
Mask precision and recalls for yolov9-c and yolov9-e using autobatch. The gap separates the pre-trained models (top) and ground-up models (bottom).

Unsurprisingly, the pre-trained models work much better, likely due to the fact that we have too small of a small dataset (~100 pictures) to train it well. We can always return this benchmark when we gather more data.

YOLO version and network size

Next question: how many parameters do we need?

YOLO has a number of versions, each of which has a number of configurations with varying parameter sizes. On paper, higher numbers of paremeters correspond to better performance, but our segmentation problem is on the easy side (we’re essentially looking for “colored blob”ish things), so we might get away with using smaller networks.

precision (M)
recall (M)
Mask precision and recalls for yolov8-m (27.3m params), yolov8-x (71.8m params), yolov9-c (27.9m params) and yolov9-e (60.5m params) using batch size 4.

Looking at the results, the models seem to perform (very) roughly the same. However, the larger ones consume significantly more memory, and have to be trained on small batch sizes (~4) to fit on the GPU. This means that increasing the image size (the subject of the next benchmark) is impossible without exceeding the available VRAM, so we will opt for the smaller yolov8-m model for now.

We can return this benchmark on more powerful hardware in the future to see if it’s worth training them on the cloud, but for now we’ll work what we have.

Image size

For our next benchmark, we’ll examine how image size relates to the quality of the results.

Taking a few image size/batch size configurations that fit into the available VRAM (640x640, bs32, 960x960, bs16, 1024x1024, bs14, 1280x1280, bs8), here are the results:

precision (M)
recall (M)
Mask precision and recalls for yolov8-m with sizes 1280 (top three) and 1024 (bottom three).

As you can see, it is pretty conclusive that image size matters a lot, with the higher image sizes resulting in higher result quality. This was somewhat surprising, since we thought that 640x640 (the default image size) would be sufficient, but it turns out that a lot of holds are small enough to be missed in lower resolutions, so up we go.

Note that the graphs only show the top two sizes, because the others were so much worse that the tuning algorithm didn’t finish running all the epochs.


There are many more benchmarks we need to run to ensure Climbuddy works well for all types of gyms and holds, but the benchmarks we’ve run so far are a sign that we’re on the right track. Some of these might include finding the right hyper parameters for training, using SAM as a post-processor for the YOLO network and many more.

Stay tuned (and hopefully so will our models)! 🙂

Team Climbuddy