• 1 Post
  • 135 Comments
Joined 1 year ago
cake
Cake day: March 22nd, 2024

help-circle
  • Yeah. But it also messes stuff up from the llama.cpp baseline, and hides or doesn’t support some features/optimizations, and definitely doesn’t support the more efficient iq_k quants of ik_llama.cpp and its specialzied MoE offloading.

    And that’s not even getting into the various controversies around ollama (like broken GGUFs or indications they’re going closed source in some form).

    …It just depends on how much performance you want to squeeze out, and how much time you want to spend on the endeavor. Small LLMs are kinda marginal though, so IMO its important if you really want to try; otherwise one is probably better off spending a few bucks on an API that doesn’t log requests.




  • At risk of getting more technical, ik_llama.cpp has a good built in webui:

    https://github.com/ikawrakow/ik_llama.cpp/

    Getting more technical, its also way better than ollama. You can run models way smarter than ollama can on the same hardware.

    For reference, I’m running GLM-4 (667 GB of raw weights) on a single RTX 3090/Ryzen gaming rig, at reading speed, with pretty low quantization distortion.

    And if you want a ‘look this up on the internet for me’ assistant (which you need for them to be truly useful), you need another docker project as well.

    …That’s just how LLM self hosting is now. It’s simply too hardware intense and ad hoc to be easy and smart and cheap. You can indeed host a small ‘default’ LLM without much tinkering, but its going to be pretty dumb, and pretty slow on ollama defaults.




  • Oh wow, that’s awesome! I didn’t know folks ran TDP tests like this, just that my old 3090 seems to have a minimum sweet spot around that same same ~200W based on my own testing, but I figured the 4000 or 5000 series might go lower. Apparently not, at least for the big die.

    I also figured the 395 would draw more than 55W! That’s also awesome! I suspect newer, smaller GPUs like the 9000 or 5000 series still make the value proposition questionable, but still you make an excellent point.

    And for reference, I just checked, and my dGPU hovers around 30W idle with no display connected.


  • brucethemoose@lemmy.worldtoSelfhosted@lemmy.world1U mini PC for AI?
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    16 days ago

    Eh, but you’d be way better off with an X3D CPU in that scenario, which is both significantly faster in games, about as fast outside them (unless you’re dram bandwidth limited) and more power efficient (because they clock relatively low).

    You’re right about the 395 being a fine HTPC machine by itself.

    But I’m also saying even an older 7900, 4090 or whatever would be way lower power at the same performance as the 395’s IGP, and whisper quiet in comparison. Even if cost is no object. And if that’s the case, why keep a big IGP at all? It just doesn’t make sense to pair them without some weirdly specific use case that can use both at once, or that a discrete GPU literally can’t do because it doesn’t have enough VRAM like the 395 does.


  • brucethemoose@lemmy.worldtoSelfhosted@lemmy.world1U mini PC for AI?
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    16 days ago

    Eh, actually that’s not what I had in mind:

    • Discrete desktop graphics idle hot. I think my 3090 uses at least 40W doing literally nothing.

    • It’s always better to run big dies slower than small dies at high clockspeeds. In other words, if you underclocked a big desktop GPU to 1/2 its peak clockspeed, it would use less than a fourth of the energy and run basically inaudible… and still be faster than the iGPU. So why keep a big iGPU around?

    My use case was multitasking and compute stuff. EG game/use the discrete GPU while your IGP churns away running something. Or combine them in some workloads.

    Even the 395 by itself doesn’t make a ton of sense for an HTPC because AMD slaps so much CPU on it. It’s way too expensive and makes it power thirsty. A single CCD (8 cores instead of 16) + the full integrated GPU would be perfect and lower power, but AMD inexplicably does not offer that.

    Also, I’ll add that my 3090 is basically inaudible next to a TV… key is to cap its clocks, and the fans barely even spin up.



  • brucethemoose@lemmy.worldtoSelfhosted@lemmy.world1U mini PC for AI?
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    18 days ago

    It’s PCIe 4.0 :(

    but these laptop chips are pretty constrained lanes wise

    Indeed. I read Strix Halo only has 16 4.0 PCIe lanes in addition to its USB4, which is resonable given this isn’t supposed to be paired with discrete graphics. But I’d happily trade an NVMe slot (still leaving one) for x8.

    One of the links to a CCD could theoretically be wired to a GPU, right? Kinda like how EPYC can switch its IO between infinity fabric for 2P servers, and extra PCIe in 1P configurations. But I doubt we’ll ever see such a product.



  • brucethemoose@lemmy.worldtoSelfhosted@lemmy.world1U mini PC for AI?
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    1
    ·
    edit-2
    18 days ago

    If you can swing $2K, get one of the new mini PCs with an AMD 395 and 64GB+ RAM (ideally 128GB).

    They’re tiny, lower power, and the absolute best way to run the new MoEs like Qwen3 or GLM Air for coding. TBH they would blow a 5060 TI out of the water, as having a ~100GB VRAM pool is a total game changer.

    I would kill for one on an ITX mobo with an x8 slot.


  • I dunno about Linux, but on Windows I used to use something called K10stat to manually undervolt cores with no access to such via the BIOS. The difference was night and day dramatic, as they idled ridiculously fast and AMD left a ton of voltage headroom back then.

    I bet there’s some Linux software to do it. Look up if anyone used voltage control software for desktop Phenom IIs and such.