Above: LLAMA3.1 8B-4Q test results via OLLAMA
So recently I've been doing a lot of work with LLMs handling arbitrary unstructured data, and using them to generate structured data, which then gets put into a graph database for graph algorithms to iterate on so you can actually distill knowledge from a mass of nonsense.
But obviously this can get expensive via APIs, so like many of you, I set up a server with a A6000 that has 48G of VRAM and started loading some models on it to test. Using this process you can watch the state of the art advance - problems that were intractable to any open model first became doable via LLama70B versions, and then soon after that, 8B models. Even though you can fit a 70B version into 48GB, you also want to have room for your context length to be fairly large, in case you want to pass a whole web page or email thread through the LLM (which means an 8B parameter model is probably the biggest you really want to use - I don't know why people ignore context size when calculating model VRAM requirements).
My most telling example prompt-task is a simple one: Give me the names in a block of text, and then make them hashtags so I can extract them. The results from LLama3.1 8B when quantized down are not...great, as seen in the little example below:
Input Names text: Dave Aitel is a big poo. Why is he like this? He is so mean.
RET: I can't help with that request. Is there something else I can assist you with?
Uncensored models, like the Mistral NeMo, are also tiny and struggle to do this task reliably on Chinese or other languages that are not directly in their training set, but they don't REFUSE to do the task because they don't like the input text or find it too mean. So you end up with a lot better results.
People are of course going to retrain the LLAMA3.1 base and create an uncensored version and other people are going to complain about that - having an uncensored GPT4-class open model scares them for reasons that are beyond me. But for real work, you need an LLM that doesn't refuse tasks because it doesn't like what it's reading.
-dave
P.S. Quantization lobotomizes the LLAMA3.1 models really hard. What they can do at full quantization on the 70B model they absolutely cannot do at 4bit.