FieryLion summed it up pretty well. I've played with models under 20b parameters mostly, and they repeat themselves *frequently*. They'll basically blueprint or template part of a message and just keep pasting it back in like it's boilerplate. I had pretty good success with a 30b parameter Gemma fork running locally (I'm on a 2070 Super with 8GB VRAM), although I only get 3-5 tokens per second (super, super slow). But the dialog quality was absolutely stunning compared to the 8b and 14b models. Qwen 2.5 is listed in several places as a "potato-friendly" model, but it's very unsophisticated: it can't really handle lewd/suggestive content in anything that would titilate or suspend disbelief, it is very repetitive and subject to "GPT-isms", and even if you set it for high temperature and sampling (to add a lot of randomness and creativity), it's still heavily constrained by how small it is.
I _personally_ don't like running AI through a hosted endpoint, even though I could be using some really good hardware and getting really fancy models out of it. My two main reasons are privacy and cost. But if you want better quality responses, you're going to have to either pay for much better hardware or hosting solutions (such as OpenRouter).