

To be clear, VMs absolutely have overhead but Docker/Podman is the question. It might be negligible.
And this is a particularly weird scenario (since prompt processing literally has to shuffle ~112GB over the PCIe bus for each batch). Most GPGPU apps aren’t so sensitive to transfer speed/latency.



deleted by creator