AI results based on Large Language Models
are probabilistic, not deterministic.

If you want to use AI to explore your qualitative data, it’s very important to understand exactly what this means.

Introduction

Large Language Models are not intelligent in any recognizable way. LLM output is the result of a statistical process where the language model does nothing more than predict what next word might reasonably appear based on a complex mathematical analysis of what words appear near each other in their training data. Because this training data is huge, LLMs do a remarkably good job at mimicking human speech, and can appear surprisingly coherent, even insightful at times.But they don’t understand anything about anything.

The engineers who create LLMs have instructed them to introduce a degree of randomness in the process of next word selection. It turns out that LLM output appears much more like natural language if they select their next word from a list of likely next words instead of always selecting the single highest-probability next word. In other words, next word selection is based on probability rather than mathematical determinism.

This clever bit of code makes LLM output appear more human. But it is important that we don’t mistake LLM output for actual intelligence. It also has important implications for the use of AI in exploring qualitative data.

An Illustrative Prompt

I illustrate this point with a set of relatively simple AI prompts. These prompts are not useful for exploring qualitative data, but seeing how AI models handle these requests is enlightening and has important implications for using AI to explore qualitative data. 

The Prompt:

The table that follows contains a list of Ollama models and their file size on disk. Create a new table that lists a model count number you assign, the model name, and the model size. Sort this list from largest size to smallest size. At the end, add a summary that indicates the number of models in the list and the total disk space these models use.

I add different-sized lists of data items to see how much data the different models can handle correctly and where the process starts to break down.

One reason this prompt is useful is because there is a definitive correct response. We know the number of items in the list, and can easily determine if the output is sorted correctly. The task doesn’t require much intelligence, but accuracy and a certain degree of attention to detail is important. With qualitative data, it can be much more difficult to determine if an LLM response is factually accurate or not, so understanding the reliability of different AI models is an important place to start.  This prompt gives us insight into the models’ reliability.

The Models:

To investigate this set of prompts, I chose the gemma4 models, recently released by Google. Gemma4 is available in e2b, e4b, 12b, 26b, and 31b sizes, these numbers representing the number of “parameters” embedded in the models. These models represent the state of the art for embeddable LLMs that can be run on standard consumer computers using the Ollama architecture.  The gemma4 series of models performed much better than most other models I tested with this prompt, and looking only at these models helps keep this test manageable.

I used a device context size of 64K for all prompt submissions. This context window has proven more than sufficient for even the longest lists submitted.

The Computers:

I submitted this set of prompts to multiple computers:

Name OS Processor / GPU RAM
Win1 Windows Intel(R) Core(TM) i7-10700
NVIDIA GeForce GTX 1660 Ti
64 GB RAM
6 GB vRAM
Win2 Windows Intel(R) Core(TM) Ultra 7 155H
NVIDIA GeForce RTX 4070 Laptop GPU
32 GB RAM
8 GB vRAM

I attempted to submit sets of prompts to two Macs, but since my Macs have only 8 GB of RAM, neither was able to run most of the prompts in a reasonable amount of time. For example, processing the 10 item list prompt with the gemma4:12b model took an average of about 6 minutes on my two Windows computers, while the same prompt with the same model took an average of over 7 hours on the two 8 GB Macs. The 20 item lists have been processing for over 16 hours so far, proving them useless for everyday AI processing.

If you want to use embedded AI on a Mac, make sure you have at least 16GB of RAM, and preferably more.

The Lists

I ran the prompt repeatedly with different amounts of data. I used lists of 10, 20, 30, 40, 50, 60, 70, 100, and 149 items. (Why 149 instead of 150? Because that’s the number of models I have installed on my Ollama server, setting the maximum of the list size I can create programmatically.)

Results

The most important factors for success in this set of AI tests are AI model and the size of the list submitted. In all, I preformed 90 tests (5 models x 9 list sizes x 2 computers), representing almost 36 hours of actual compute time.

This table summarizes the results of my test runs.

Model Number of Items Count Sort
gemma4:e2b 10
20
30 – 149
Correct
1 of 2
Incorrect
Correct
Incorrect
Incorrect
gemma4:e4b 10
20
30
40 – 149
Correct
1 of 2
Correct
Incorrect
Correct
Correct as listed
1 of 2
Incorrect
gemma4:12b 10 – 40
50
60
70 – 100
149
Correct
1 of 2
Correct
Incorrect
Incorrect
Correct
Correct as listed
Correct
Correct as listed
Incorrect
gemma4:26b 10 – 60
70
100
149
Correct
1 of 2
Incorrect
Incorrect
Correct
Correct as listed
Correct as listed
Incorrect
gemma4:31b 10 – 70
100
149
Correct
1 of 2
Incorrect
Correct
Correct as listed
Correct as listed

(Update:

Since I assembled this table, I made my evaluation program more sensitive. I found two new categories of errors that are not yet reflected in these results. First, some models listed the correct number of items but contained duplicate and missing items. Second, some models listed incorrect sizes for some models or for the total disk space used.)

Gemma4:e2b

The smallest of the Gemma4 models was only able to reliably count and sort 10 list items. When presented 20 list items, one computer missed one entry and neither computer sorted the list correctly. Neither computer counted or sorted correctly when given 30 list items or more.

Average processing time for the Gemma4:e2b model was 2:16, with a range of 0:19 to 4:32.

Gemma4:e4b

The Gemma:e4b model successfully sorted 10 item lists on both computers. When presented 20 items, one computer was successful, while the other computer skipped one item but sorted the 19 items listed correctly. When presented 30 items, both computers correctly listed 30 items but only one computer sorted the list correctly. Neither computer was able to count or sort 40 items or more.

Average processing time for the Gemma4:e4b model was 4:17, with a range of 1:08 to 11:08.

Gemma4:12b

The Gemma4:12b model was able to correctly count and sort up to 60 items, although one computer missed one item in the 50 item test. (It sorted the items it presented correctly.) It missed between 1 and 5 items in the 70 and 100 item lists, although it sorted all items it included correctly. In the 149 items list, one computer presented 148 items while the other presented 272 items, and neither computer sorted these lists correctly.

Average processing time for the Gemma4:12b model was 1:29:20, with a range of 4:29 to 6:04:10. It is likely that other processes (such as running automated off-site backups) interfered with the speed of a couple of the tests on one computer.

Gemma4:26b

Both computers were able to count and sort lists up to 60 items using Gemma4:26b. One of two computers skipped 3 items in the 70 item list, but both sorted what they presented correctly. Both computers dropped a small number of items from the 100 item list but successfully sorted what they presented. Neither computer could count or sort the 149 item list.

Average processing time for the Gemma4:26b model was 19:17, with a range of 3:00 to 49:09.

Gemma4:31b

Gemma4:31b is the largest and most powerful of the Gemma4 models. It was able to correctly process lists up to 70 items on both computers, while one computer skipped one item on the 100 item list. Both computers dropped 3 items on the 149 item lists but were able to correctly sort the items they presented.

Average processing time for the Gemma4:31b model was 1:23:52, with a range of 12:51 to 3:13:06.

Conclusions

Why is this set of results important? How is it relevant to qualitative analysis? Clearly, the results expected from this set of AI prompts have a “right” answer. They should be deterministic. In contrast, typical results from qualitative data exploration can accept being more flexible, more probabilistic in nature. Aren’t we comparing apples and oranges here?

For the results of a prompt about qualitative data to be intersting, for those results to be potentially meaningful, they must first and foremost be based on accurate information. The prompts in this experiment show us the limits of what generative AI as implemented through LLMs are capable of. These prompts show us where LLMs lose the ability to represent information accurately.  If the LLM cannot represent the data accurately internally, it certainly cannot produce meaningful insights in its output.

The current analysis shows us that the amount of data submitted to an LLM matters. When a list contains only 10 items, it is always counted and sorted correctly. At 100 items, only 1 of 10 runs produced the correct output. At 149 items, no list was processed perfectly. A list of 149 items with two values per line is, to put it plainly, not much data, especially when you compare it to, say, a full transcript of an hour-long media file.

The current analysis also teaches us something important about the number of parameters built into an LLM. The larger the number of parameters, the more list items could be handled correctly. We see a big jump between the e4b model (technically the “effective 4 billion parameters version of the model) to the 12b model (the 12 billion parameters version of the model) in the number of items that can be processed correctly. (Note that the context size of all prompt submissions was held constant at 64K in this experiment. Context size, while important, is a separate issue and does not influence the current results.)

The implications for exploring qualitative data with Transana are clear. First, it is important to be conscious of the amount of data being explored at one time. This is why Transana limits exploration to a single document or transcript at one time, and users should limit the scope of explorations of Collections in a mindful way. Researchers need to be conscious of not trying to do “too much” with a single AI prompt.

Second, as a general rule, it is probably safer to use models with more parameters when reasonable and possible. The more data you explore at once, the more important it is to use a sufficiently large and powerful model.  Of course, what models you can use depends on your hardware.

Finally, it is important to view the results of AI data exploration as hints rather than results. AI can be prone to errors and hallucinations, especially when stretched by the type and volume of data explored in qualitative analysis. In the current analysis, AI was almost always confident it the responses it provides, even when it is demonstrably wrong. “Here is the ranked list of Ollama models sorted by size from largest to smallest,” the results would assert. As Groucho would say, “who are you going to believe, me, or your lying eyes?” Researchers should accept no insights or conclusions produced by AI exploration until they have delved into their data to confirm or refute the AI assertions.