• brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    17 hours ago

    Completely depends on your laptop hardware, but generally:

    • TabbyAPI (exllamav2/exllamav3)
    • ik_llama.cpp, and its openai server
    • kobold.cpp (or kobold.cpp rocm, or croco.cpp, depends)
    • An MLX host with one of the new distillation quantizations
    • Text-gen-web-ui (slow, but supports a lot of samplers and some exotic quantizations)
    • SGLang (extremely fast for parallel calls if thats what you want).
    • Aphrodite Engine (lots of samplers, and fast at the expense of some VRAM usage).

    I use text-gen-web-ui at the moment only because TabbyAPI is a little broken with exllamav3 (which is utterly awesome for Qwen3), otherwise I’d almost always stick to TabbyAPI.

    Tell me (vaguely) what your system has, and I can be more specific.