• 0 Posts
  • 7 Comments
Joined 7 months ago
cake
Cake day: June 29th, 2024

help-circle
  • I don’t think ‘cattle not pets’ is all that corporate, especially w/r/t death of the author. For me, it’s more about making sure that failure modes have (rehearsed) plans of action, and being cognizant of any manual/unreplicable “hand-feeding” that you’re doing. Random and unexpected hardware death should be part of your system’s lifecycle, and not something to spend time worrying about. This is also basically how ZFS was designed from a core level, with its immense distrust for hardware allowing you to connect whatever junky parts you want and letting ZFS catch drives that are lying/dying. In the original example, uptime seems to be an emphasized tenet, but I don’t think it’s the most important part.

    RE replacements on scheduled time, that might be true for RAIDZ1, but IMO a big selling point of RAIDZ2 is that you’re not in a huge rush to get resilvering done. I keep a cold drive around anyway.


  • “Cattle not pets” in this instance means you have a specific plan for the random death of a HDD (which RAIDZ2 basically already handles), and because of that you can work your HDDs until they are completely dead. If your NAS is a “pet” then your strategy is more along the lines of taking extra-good care of your system (e.g. rotating HDDs out when you think they’re getting too old, not putting too much stress on them) and praying that nothing unexpected happens. I’d argue it’s not really “okay” to have pets just because you’re in a homelab, as you don’t really have to put too much effort into changing your setup to be more cynical instead of optimistic, and it can even save you money since you don’t need to worry about keeping things fresh and new.

    “In the old way of doing things, we treat our servers like pets, for example Bob the mail server. If Bob goes down, it’s all hands on deck. The CEO can’t get his email and it’s the end of the world. In the new way, servers are numbered, like cattle in a herd. For example, www001 to www100. When one server goes down, it’s taken out back, shot, and replaced on the line.”

    ~from https://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/




  • I recommend a dead man’s switch like Healthchecks.io, which can be selfhosted for free. Whenever you have something that’s regularly occurring, add an extra callout to your unique Healthchecks callout UUID as part of the automation, and Healthchecks will send you a notification if something misses its callout schedule. You can also attach whatever data (e.g. a log) to the callout so you can look back through the run history. IIRC Borg will give you a non-zero return code if it detects problems, so you can send e.g. https://hc-ping.com/your-uuid-here/$? and a non-zero code will signal a notification as well (more examples here).



  • I used Proxmox for a couple years and it’s good if you run a lot of VMs or LXCs, but I found that I’m not really the target audience. I ended up only running one Debian VM for my Docker containers. It was fine, but I eventually felt that Proxmox added no value for me, and the end result was sacrificing some memory and performance from using virtio emulations for CPU/GPU/RAM/filesystems. If your machines only have 8-16GB of RAM I don’t think it would be a good idea, as I’ve seen the rule of thumb is to dedicate 2GB for Proxmox’s usage, which is in addition to any guest OS’s requirements. Meanwhile I have a Debian install on a VPS that takes about 450MB of RAM.

    For me, pros:

    • Native ZFS support - invaluable, ZFS is terrific. MergerFS+SnapRAID is a decent replacement but the dodgy tooling and laundry list of footguns makes me nervous to use it on important data. ZFS is idiot-proof, as long as you know what you’re doing during the initial setup. RAIDZ expansion is coming this year and you can still use mixed-size disks in a RAIDZ as long as you accept that all disks are equivalent to the smallest one, so I personally feel ZFS is acceptable for grab-bag disk usage now
    • Separation of bare metal and server environment, which means you can spin up another server VM from scratch without impacting the previous one, then switch with zero downtime. In the end, I replaced Proxmox with Debian on ZFS root (ZFSBootMenu) and wrote a few hundred lines of bash to automate the installation, so when I switched it only took about 30 minutes of downtime start to finish.
    • Isolation of different environments. If my VM gets hacked, it will have a harder time reaching my Proxmox host etc. I run all services in isolated Docker environments anyway so this isn’t that big of a perk for my threat profile.

    Cons:

    • Partitioning RAM for ZFS ARC, Proxmox, and VM leads to inherent inefficiencies at the margins.
    • I usually give my VM n-1 CPU cores, which is still less power than if I had just used the CPU natively.
    • GPU passthroughs to VM can be less efficient, depending on the GPU and how it handles it. My iGPU is less performant when using its ~SR-IOV feature
    • Learning requirement - not a huge learning curve but it’s a lot of knowledge that I will not use now that I’ve stopped using Proxmox
    • Hosting your data pool on the Proxmox host or a dedicated data VM means that your server VM needs to use NFS to access its data, which lacks a handful of features (e.g. inotify) and is a pain
    • Need to maintain two systems for updates, downtimes, etc
    • More points of failure
    • Extra startup time
    • Run by a company that thinks it’s okay to use winrar-style nag popups every time you load the console, and requires you to manually dig through the source to disable that. I understand it’s their business model, it doesn’t change how it affects me the end user who lacks $550/year to spend on disabling a popup