Using Mac M2 Ultra 192GB to Self-Host LLMs?

shaserlark@sh.itjust.works · 3 months ago

This is honestly awesome! I was thinking about a similar setup for a long time but wasn’t sure how to do this exactly, this seems exactly like the setup I was looking for. Thank you!

shaserlark@sh.itjust.works · 4 months ago

You can drploy a Cloudflare worker that exposes an APi endpoint with an SQLite DB completely for free and without doing any maintenance. I don’t think the DB is encrypted , so it wouldn’t be my first choice if privacy is a concern. There’s a bit of a learning curve with all the UI bloat but once you figured it out it’s a very hassle free solution.

shaserlark@sh.itjust.works · 6 months ago

Shit just works as usual

shaserlark@sh.itjust.works · 6 months ago

https://hotio.dev/containers/qbittorrent/

Why don’t you use the hotio container? That already has it baked in

shaserlark@sh.itjust.works · 6 months ago

Thanks for the reply, still reading here. Yeah thanks to the comments and reading some benchmarks I abandoned the idea of getting an Apple, it’s just too slow.

I was hoping to test Qwen 32B or llama 70b for running longer contexts, hence the apple seemed appealing.

shaserlark@sh.itjust.works · 6 months ago

Congrats on being that guy

shaserlark@sh.itjust.works · 6 months ago

You’re aware that there’s the OpenAI API library right? https://github.com/openai/openai-python

It’s really nothing fancy especially on Lemmy where like 99% of people are software engineers…

shaserlark@sh.itjust.works · 6 months ago

Are you drunk?

shaserlark@sh.itjust.works · edit-2 6 months ago

Yeah I found some stats now and indeed you’re gonna wait like an hour to process if you throw like 80-100k token into a powerful model. With APIs that kinda works instantly, not surprising but just to give a comparison. Bummer.

shaserlark@sh.itjust.works · edit-2 6 months ago

Thanks! Hadn’t thought of YouTube at all but it’s super helpful. I guess that’ll help me decide if the extra Ram is worth it considering that inference will be much slower if I don’t go NVIDIA.

shaserlark@sh.itjust.works · 6 months ago

Yeah I was thinking about running something like Code Qwen 72B which apparently requires 145GB Ram to run the full model. But if it’s super slow especially with large context and I can only run small models at acceptable speed anyway it may be worth going NVIDIA alone for CUDA.

shaserlark@sh.itjust.works · 6 months ago

Meh, ofc I don’t.

shaserlark@sh.itjust.works · 6 months ago

Thanks, that’s very helpful! Will look into that type of build

shaserlark@sh.itjust.works · 6 months ago

I understand what you’re saying but I’m coming to this community because I like having more input, hear about the experience of others and potentially learn about things I didn’t know about. I wouldn’t ask specifically in this community if I wouldn’t want to optimize my setup as much as I can.

shaserlark@sh.itjust.works · 6 months ago

Interesting, is there any kind of model you could run at reasonable speed?

I guess over time it could amortize but if the usability sucks that may make it not worth it. OTOH really don’t want to send my data to any company.

shaserlark@sh.itjust.works · 6 months ago

I’d honestly be open for that but would an AMD setup not take up a lot of space and consume lots of power / be loud?

It seems like in terms of price & speed, the Macs suck compared to other options, but if you don’t have a lot of space and don’t want to hear an airplane engine constantly I’m wondering if there are options.

shaserlark@sh.itjust.works · 6 months ago

Yeah the VRAM of Mac M series is very attractive for running models at full context length and the memory bandwidth is quite good for token generation compared to the price, power consumption and heat generation of NVidia GPUs.

Since I’ll have to put this in my kitchen/living room that’d be a big plus but idk how well prompt processing would work if I send over like 80k tokens.

shaserlark@sh.itjust.works · edit-2 6 months ago

Using Mac M2 Ultra 192GB to Self-Host LLMs?

shaserlark@sh.itjust.works · 7 months ago

So would this work well e.g. with the the *arr stack? Because most of the services wouldn’t even need to run always

shaserlark@sh.itjust.works · edit-2 8 months ago

I’m intrigued! But how does it compare to React which is pretty straight forward? I’m not a frontend dev so what’s really great about React is that it works super well with LLMs.

shaserlark@sh.itjust.works · 8 months ago

Thanks! Super helpful and I’d love to have the compose and install script. I also looked into the Helm charts but still wondering if I should go down that route or not eventually.

shaserlark@sh.itjust.works · edit-2 8 months ago

Using Mac M2 Ultra 192GB to Self-Host LLMs?

Using Mac M2 Ultra 192GB to Self-Host LLMs?

Selfhosting GitLab?

Selfhosting GitLab?