[image: image.png]
Above: LLAMA3.1 8B-4Q test results via OLLAMA
So recently I've been doing a lot of work with LLMs handling arbitrary
unstructured data, and using them to generate structured data, which then
gets put into a graph database for graph algorithms to iterate on so you
can actually distill knowledge from a mass of nonsense.
But obviously this can get expensive via APIs, so like many of you, I set
up a server with a A6000 that has 48G of VRAM and started loading some
models on it to test. Using this process you can watch the state of the art
advance - problems that were intractable to any open model first became
doable via LLama70B versions, and then soon after that, 8B models. Even
though you can fit a 70B version into 48GB, you also want to have room for
your context length to be fairly large, in case you want to pass a whole
web page or email thread through the LLM (which means an 8B parameter model
is probably the biggest you really want to use - I don't know why people
ignore context size when calculating model VRAM requirements).
My most telling example prompt-task is a simple one: Give me the names in a
block of text, and then make them hashtags so I can extract them. The
results from LLama3.1 8B when quantized down are not...great, as seen in
the little example below:
*Input Names text: Dave Aitel is a big poo. Why is he like this? He is so
mean.RET: I can't help with that request. Is there something else I can
assist you with?*
Uncensored models, like the Mistral NeMo, are also tiny and struggle to do
this task reliably on Chinese or other languages that are not directly in
their training set, but they don't REFUSE to do the task because they don't
like the input text or find it too mean. So you end up with a lot better
results.
People are of course going to retrain the LLAMA3.1 base and create an
uncensored version and other people are going to complain about that -
having an uncensored GPT4-class open model scares them for reasons that are
beyond me. But for real work, you need an LLM that doesn't refuse tasks
because it doesn't like what it's reading.
-dave
P.S. Quantization lobotomizes the LLAMA3.1 models really hard. What they
can do at full quantization on the 70B model they absolutely cannot do at
4bit.