Using AI in Transana
tl;dr
This page starts with a comparison of External vs. Embedded AI options. It describes essential information about embedded models.
Next, the page lays out how I evaluated the numerous AI models available in Transana. I go into a lot of detail in this section so you can understand exactly how I reached my conclusions. Lessons Learned describes those conclusions.
The section called Asking AI to Evaluate AI provides rankings of the AI models available in Transana for exploring Text and for exploring Images. If you are here looking for guidance about what AI models to use, read this section.
But if you read nothing else, please read and re-read the last section, called The Last Word. Understanding that section is vital to all researchers who want to use AI as part of qualitative data analysis.
Introduction
To use AI well in Transana, researchers need to make several important decisions. This article highlights two major choices.
The choice between using external AI tools and embedded (internal) AI tools is fairly straight-forward, and this article lays out the advantages and disadvantages of each option.
The choice of which AI model to use when exploring research data is more challenging. While external AI tools generally offer limited choices and limited flexibility, embedded AI offers an overwhelming set of choices. The bulk of this article is dedicated to laying out these choices, describing the process I used narrow down the options, and sharing the results of my extensive testing of AI models..
External AI and Embedded AI
The first choice a researcher must make when using AI in Transana is between external AI and embedded (or internal) AI.
External AI uses an outside service such as ChatGPT, Gemini, Co-pilot, or Claude, for AI queries. Currently, Transana supports ChatGPT for external exploration of data.
Embedded AI uses an AI service on a computer controlled by the researcher. Currently, Transana supports the Ollama system, which offers hundreds of individual AI models to choose from.
Each of these systems has advantages and disadvantages:
| External AI | Embedded AI |
|---|---|
|
Advantages:
Quality. ChatGPT 5 often offered excellent quality responses during my AI prompt tests. (See below.) It was less prone to making false statements than some of the other models I tested. |
Advantages:
Private, Secure. When properly configured, data is securely processed on a computer controlled by the researcher. No data is sent to external servers. |
|
Disadvantages:
Requires an account. External AI providers require user accounts. They want a credit card, and track usage. |
Disadvantages:
Requires setup. The researcher must set up their own AI server. This is fairly easy. but requires installing an additional program. (See the Ollama Setup Instructions.) Researchers must also download the model(s) they want to use. |
|
Implementation:
Transana uses OpenAI’s ChatGPT tool to provide external AI exploration. |
Implementation:
Transana uses the Ollama system to provide embedded AI exploration. |
Privacy and Confidentiality
Data privacy, security, and confidentiality are central issues for most research projects. When a researcher uses an external AI tool, they send their data to a server that is not under their control. There are a wide variety of company policies and legal issues that influence what happens to the data once is received by the external computer. It is imperative that researchers understand the privacy and confidentiality policies of the companies they work with for external AI exploration of their data. Researchers should never submit data for external AI exploration without the explicit approval of their Institutional Review Board or other ethics board overseeing their research.
With embedded AI processing in Transana, all AI processing is handled by the Ollama server select by the researcher. This Ollama Server may reside on their own computer, or they may configure Transana to use an Ollama server on a different computer under their control. When properly configured, the Ollama server does not share or retain any data. Researchers should only connect to Ollama servers on their own computer or that is controlled by someone they trust, such as their department, university, or organization IT department. Choosing a server they control is how they ensure the privacy and confidentiality of their data during the AI exploration phase of your analysis.
Selecting Models
For external AI, the main choice of model is made when selecting the tool; only ChatGPT is currently supported in Transana, and ChatGPT only offers a small number of model options. If other external options such as Claude, Co-Pilot, Gemini, and others, were available, the evaluation and selection of external model would take a bit more work. Thus, external model selection is falsely simplified when compared to embedded model selection.
For Embedded AI, there are many models so choose from, models that are generally not so well known, making this choice less obvious. The remainder of this article is devoted to the topic of selecting AI models for best results.
Embedded AI Models and Model Parameters
AI Models are algorithms that determine how an AI works with data. They are trained on (typically very large) data sets and are designed to handle certain types of tasks and achieve certain types of goals. Transana supports the external ChatGPT AI service from OpenAI. It also supports an embedded AI tool called Ollama to manage the download, selection, and use of pre-defined AI models to allow the embedded exploration of qualitative data.
ChatGPT offers limited options for users, effectively hiding a lot of complexity from end users. Ollama presents a more complex AI landscape, requiring more background knowledge of the researcher.
Models
ChatGPT offers a small handful of models, mostly different versions of the same set of algorithms. Recent options (as of this writing) include ChatGPT 5, ChatGPT 5.1, and ChatGPT 5.2.
Ollama offers a large catalog of models. See the Models page on Ollama’s web site for more inforation.
Parameters
Ollama AI models are built with a characteristic called parameters, which represents internal variables used by the AI model to map input data to outputs, influencing the model’s ability to see patterns in data. To over-simplify, the higher the value of “parameters” a model is built with, the more sophisticated a response the model should be able to generate. However, the number of parameters also affects AI processing factors such as memory requirements and processing speed.
Some Ollama models support several different parameter options, and we have determined that the parameters value is important in how models work within Transana. Models are always presented as “(model name):(parameters)” pairs, for example, “gemma3:12b” for the gemma3 model with the 12 billion parameter setting. We strongly recommend that Transana users avoid the use of “cloud” and “turbo” parameters, as these require external processing within Ollama and may compromise data confidentiality.
You can explore the full list of models available with Ollama with their parameter options through the “Models” section of the Ollama website.
Testing Ollama Models
Ollama offers a huge number of models. I wanted to test as many of these models as I could in an effort to determine which models did a good job at exploring qualitative data within Transana. The next several sections of this article provides a narrative of how I approached this task and what I found.
For testing the analysis of text (including transcripts), I used a transcript of the movie “12 Angry Men” as my initial testing data because it was long enough (90 minutes) and complex enough (12 major speakers) to represent a potential challenge to AI, and because it is non-confidential data that others can obtain if they want to explore, replicate, or extend my test results.
For testing still image analysis by AI, I used a photograph I took a few years ago while traveling. The image included several distinct elements that could be included in the analysis.
While this does not represent typical qualitative research data, the narrative of the movie provides information that can be analyzed qualitatively that leads to clear conclusions, making between-model comparisons of AI results easier than if actual qualitative data had been used.
I performed three rounds of tests.
First Tests
For my first attempt at testing, I selected a very simple prompt on my main development computer, paired with the 12 Angry Men transcript.
Summarize the following transcript
For these tests, I used the default settings for Context Size (Ollama defaults to 4k, but individual models can override this) and Temperature (often 0.8, but can vary by model). Context Size is a term that describes how much memory is allocated for the AI to store and process information. Temperature is a setting that determines the amount of “randomness” the AI engine introduces in forming a response to the prompt and data submitted.
There were three mistakes in this setup that doomed these test attempts to failure.
- First, it appears that summary of data is a default prompt for many AI models. When I tried more complex prompts with the default settings used here, a surprising number of models returned a summary rather than responding to the specific prompt they were given. So a summary request is not a good test of AI models when working with text-based data.
- Second, the transcript I submitted contained more than 20,000 tokens, and this was far too much data for Ollama’s default context size for many models. When using non-summary prompts, the ability of some models to respond to the submitted prompt rather than producing a summary required a large-enough context size setting. A large-enough context size allowed some models to “see” the submitted prompt. However, this was only true for some models.
- Third, since the computer I used for these first tests has 64 GB of RAM, it did not represent the computers that many Transana users have. When I tried this prompt and data on different computers with less RAM, some models that worked on the first computer could not run the same prompts with the same settings due, presumably, to RAM limitations.
Second Tests
Having learned these three points, I abandoned this data and started over.
For my second attempt at testing, I changed my prompt to explicitly indicate that the data I submitted was a transcript of a jury deliberation and to request a description of each of the jurors.
This is a transcript of a jury deliberation. Describe each juror in a separate paragraph, including juror number, name if known, occupation, and personality.
After a little experimentation, I settled on a context size of 32K, large enough to hold the 20K+ tokens of data I was exploring and a reasonable-sized response. For this test, I used a Temperature of 0.8. I ran each test on multiple computers. I was not able to test all models on all computers because these tests could take as long as 24 hours in some circumstances. I made the decision to skip some tests that were unlikely to succeed for the sake of efficiency and sanity.
A condensed summary of the result of this testing are presented in expandable tables below. The first table presents the the model and parameter combinations that produced “good” results on at least one computer. This table does not include the large number of models for which all tests failed. The second table is a more detailed description of the five computers used for testing.
Detailed Results For 28 Models With At Least One Good Text Exploration Result (Click to expand)
| Model and Parameters | macOS M2 8GB | macOS M1 8GB | macOS M1 Pro 16GB | Win 11, 64 GB, 6 GB GPU | Win 11, 32 GB, 8 GB GPU | Success Rate |
|---|---|---|---|---|---|---|
| deepseek-r1:14b | TIME | GOOD (19:32:20.0) | ACCEPT (0:36:31.3) | ERROR | 25% | |
| devstral-small-2:24b | TIME | TIME | GOOD (0:33:30.9) | GOOD (0:22:50.1) | 50% | |
| devstral:24b | GOOD (20:36:02.7) | GOOD (17:18:05.0) | TIME | GOOD (0:38:03.0 ) | ERROR | 60% |
| falcon3:10b | ACCEPT (11:03:12.8) | GOOD (6:58:58.5 ) | ACCEPT (0:55:15.0) | GOOD (0:18:26.7 ) | GOOD (0:03:31.6 ) | 60% |
| falcon3:7b | ACCEPT (7:27:10.9 ) | ACCEPT (2:43:56.1 ) | ACCEPT (0:05:59.6 ) | GOOD (0:14:11.2 ) | GOOD (0:01:50.8 ) | 40% |
| gemma3:12b | GOOD (12:29:19.1) | GOOD (10:14:51.3) | GOOD (0:10:04.6 ) | GOOD (0:14:21.5 ) | GOOD (0:03:42.1 ) | 100% |
| gpt-oss:20b | TIME | ERROR | ERROR | ACCEPT (1:13:02.0 ) | GOOD (0:10:06.3 ) | 20% |
| granite3.1-dense:8b | ACCEPT (11:32:10.5) | GOOD (17:33:26.2) | TIME | ACCEPT (0:24:54.2 ) | REJECT (0:07:28.0 ) | 20% |
| granite3.2:8b | ACCEPT (15:15:20.1) | ACCEPT (17:25:51.6) | GOOD (0:27:47.1 ) | GOOD (0:07:38.9 ) | 50% | |
| granite3.3:8b | TIME | GOOD (12:47:44.4) | GOOD (9:44:58.4 ) | ACCEPT (0:22:08.4 ) | ACCEPT (0:08:01.2 ) | 40% |
| llama3.2-vision:11b | GOOD (11:20:37.8) | ACCEPT (9:45:02.1 ) | TIME | ACCEPT (0:19:30.1 ) | REJECT (0:04:52.4 ) | 20% |
| magistral:24b | TIME | TIME | GOOD (2:31:39.5 ) | FAIL (0:00:00.0 ) | 25% | |
| marco-o1:7b | ACCEPT (13:21:58.4) | GOOD (4:57:05.2 ) | ACCEPT (0:08:48.6 | GOOD (0:17:30.9 ) | GOOD (0:03:44.2 ) | 60% |
| ministral-3:14b | TIME | TIME | GOOD (00:34:51.7 ) | GOOD (00:24:27.0) | 50% | |
| ministral-3:3b | GOOD (01:22:50.8 ) | GOOD (01:26:11.0 ) | GOOD (00:14:10.8 ) | GOOD (00:01:38.9 ) | 100% | |
| ministral-3:8b | TIME | GOOD (18:30:19.4) | GOOD (0:23:06.4) | GOOD (0:10:57.3) | 75% | |
| mistral-small3.2:24b | GOOD (29:37:02.1) | TIME | GOOD (21:16:49.7) | GOOD (1:26:21.3 ) | GOOD (1:37:28.8 ) | 80% |
| mixtral:8x7b | ERROR | GOOD (12:29:16.1) | ERROR | GOOD (0:34:56.7 ) | GOOD (0:11:08.9 ) | 60% |
| olmo-3.1:32b | TIME | TIME | GOOD (2:08:58.4) | ACCEPT (2:04:26.4) | 25% | |
| olmo-3:32b | TIME | ERROR | TIME | GOOD (1:11:43.7) | 25% | |
| phi4-reasoning:14b | TIME | TIME | TIME | GOOD (2:11:53.1 ) | GOOD (1:46:32.0 ) | 40% |
| qwen2.5:7b | REJECT (7:28:27.4) | REJECT (3:41:36.9 ) | GOOD (0:12:31.3 ) | ERROR | 25% | |
| qwen3-vl:30b | ERROR | TIME | ERROR | GOOD (3:12:13.8 ) | GOOD (1:19:33.8 ) | 40% |
| qwen3-vl:8b | TIME | TIME | GOOD (0:24:23.2 ) | GOOD (0:50:29.6 ) | GOOD (0:54:18.6 ) | 60% |
| qwen3:14b | TIME | ERROR | GOOD (0:44:36.1 ) | ACCEPT (0:22:03.3 ) | 25% | |
| qwen3:4b | ACCEPT (17:34:11.3) | GOOD (13:37:49.0) | ACCEPT (0:08:14.9 ) | GOOD (0:16:47.4 ) | ACCEPT (0:02:25.4 ) | 40% |
| qwen3:8b | TIME | GOOD (0:14:22.4 ) | ACCEPT (0:07:28.3 ) | ACCEPT (0:22:08.6 ) | GOOD (0:16:10.0 ) | 40% |
| qwq:32b | TIME | TIME | GOOD (1:14:20.5 ) | ERROR | 25% | |
| Summary GOOD ACCEPT REJECT FAIL ERROR TIME Total: Average Time: Total Time |
5 6 1 0 2 14 28 13:15:42.0 159:08:23.9 |
12 3 1 0 3 9 28 10:34:48.4 169:16:55.2 |
4 5 0 0 3 4 16 3:40:13.7 33:02:03.3 |
21 6 0 0 0 1 28 0:50:20.1 22:39:02.5 |
17 4 2 1 4 0 28 0:30:17.4 11:36:39.7 |
59 24 4 1 12 28 128 |
| Key: | GOOD – AI model produced a relatively good result with adequate detail and few errors, none of which were major. ACCEPT – AI model produced a result with adequate information on most jurors and few serious errors. Some important points may have been left out. REJECT – AI model produced a result that lacked substance or contained major errors. FAIL – AI model did not produce a meaningful result. It sometimes produced a summary of the data, but that’s not what I asked for. Other times, I didn’t even get that. ERROR – AI model produced an error message rather than a result. This message usually indicated insufficient memory or other resources. TIME – AI model ran for a very long time without producing a result or an error message. For four of the five computers, I stopped tests after 24 hours. |
|||||
Additional Details About Testing Computers (Click to expand)
Our testing has been conducted on five computers.
macOS testing was done on three computers.
- An M2-bases Mac Mini with 8 GB of RAM. All Ollama models were stored on an external hard drive.
- An M1-based Macbook Pro with 8 GB of RAM.
- Additional testing was conducted at a different location using a computer with an M1 Pro processor and 16 GB of RAM.
Windows testing was conducted using two computers.
- An older desktop with 64 GB of RAM and a 6 GB Nvidia graphics card
- A newer laptop with 32 GB of RAM and an 8 GB Nvidia graphics card
Both computer are running Windows 11.
(If you are interested in helping help with additional testing, or would like to donate computers or money to our hardware fund, please let me know through the Contact Form.)
There are several important points to keep in mind when reviewing this.
- First, new models are coming out all the time. This is a snapshot of a moving target.
- Second, I used a very broad prompt. In actual analysis, revising the prompt is an important step in AI exploration, and it is likely that changing the prompt will affect the AI output in both expected and unexpected ways.
- Third, this is one example of a prompt and data. The data is from a movie, so may not represent real-world research data in important ways.
Third Test
For the third round of testing, I submitted a still image to the full suite of Ollama models with the following query:
Describe the following image:
While asking for a description was not a good test for text data, it revealed something very interesting for these image tests. Some models described an image in response to this prompt that was clearly not the image I submitted. This description of a different image could sometimes be quite detailed. Thus, for images, this description prompt ends up revealing models that are not able to process images the way Transana submits them but that do not inform the researcher of this failure.
I settled on a context size of 48K for the image. For this test, I set the Temperature to 0.3, as I wanted to get more consistent results. I ran each test on multiple computers. I was not able to test all models on all computers because these tests could take as long as 24 hours in some circumstances and I made the decision to skip some tests that were unlikely to succeed for the sake of efficiency and sanity.
The results of this testing are presented in an expandable table below. This table presents the the 21 model and parameter combinations that produced “good” results on at least one computer. Note that there is limited overlap between these models and models that did well with text. (See both above and below.)
Detailed Results For 21 Models With At Least One Good Image Exploration Result (Click to expand)
| Model | Mac1 | Mac2 | Win1 | Win2 | Result | |
|---|---|---|---|---|---|---|
| bakllava:7b | GOOD (1:24:09.4) | GOOD (0:11:21.6) | ERROR | GOOD (0:00:18.9) | 75% | |
| devstral-small-2:24b | GOOD (3:19:01.2) | GOOD (2:59:33.2) | GOOD (0:09:37.8) | GOOD (0:02:17.0) | 100% | |
| gemma3:12b | GOOD (2:07:42.8) | GOOD (1:51:55.1) | GOOD (0:10:15.4) | GOOD (0:01:23.4) | 100% | |
| gemma3:4b | GOOD (0:00:15.6) | GOOD (0:00:39.5) | GOOD (0:01:54.2) | GOOD (0:00:24.0) | 100% | |
| granite3.2-vision:2b | GOOD (0:00:59.2) | GOOD (0:00:48.3) | GOOD (0:01:49.4) | GOOD (0:00:14.6) | 100% | |
| llama3.2-vision:11b | GOOD (0:44:21.7) | GOOD (1:18:09.5) | GOOD (0:11:51.6) | GOOD (0:01:52.8) | 100% | |
| llava-llama3:8b | GOOD (1:01:03.8) | GOOD (0:06:52.7) | ERROR | GOOD (0:00:29.2) | 75% | |
| llava-phi3:3.8b | GOOD (0:00:45.6) | GOOD (0:00:26.1) | ERROR | GOOD (0:00:15.5) | 75% | |
| llava:13b | GOOD (3:37:45.2) | GOOD (0:28:05.5) | ERROR | GOOD (0:01:02.4) | 75% | |
| llava:7b | GOOD (1:18:17.1) | GOOD (0:20:36.1) | FAIL (0:02:25.2) | GOOD (0:00:26.1) | 75% | |
| minicpm-v:8b | GOOD (0:49:18.2) | GOOD (0:27:14.9) | ERROR | GOOD (0:00:30.1) | 75% | |
| ministral-3:14b | GOOD (1:25:58.7) | GOOD (2:30:50.0) | ACCEPT (0:06:10.9) | GOOD (0:01:21.5) | 100% | |
| ministral-3:3b | GOOD (0:08:42.7) | GOOD (0:51:35.1) | GOOD (0:02:07.8) | GOOD (0:00:25.1) | 100% | |
| ministral-3:8b | GOOD (0:33:13.3) | GOOD (1:22:12.0) | GOOD (0:04:04.9) | GOOD (0:00:52.5) | 100% | |
| mistral-small3.2:24b | GOOD (2:42:45.3) | ERROR | GOOD (0:16:01.0) | GOOD (0:02:12.0) | 75% | |
| qwen2.5vl:3b | ERROR | ERROR | GOOD (0:09:42.6) | GOOD (0:07:40.5) | 50% | |
| qwen2.5vl:7b | ERROR | ERROR | GOOD (0:15:46.9) | GOOD (0:09:57.6) | 50% | |
| qwen3-vl:2b | GOOD (0:07:47.5) | GOOD (0:09:22.2) | GOOD (0:02:21.2) | GOOD (0:00:29.8) | 100% | |
| qwen3-vl:30b | GOOD (3:03:43.2) | GOOD (2:12:40.5) | GOOD (0:11:39.2) | GOOD (0:02:48.0) | 100% | |
| qwen3-vl:4b | GOOD (0:41:15.2) | GOOD (0:37:15.7) | GOOD (0:03:14.6) | GOOD (0:01:06.3) | 100% | |
| qwen3-vl:8b | GOOD (2:48:07.8) | GOOD (2:00:03.4) | GOOD (0:05:48.7) | GOOD (0:02:15.7) | 100% | |
| Summary GOOD ACCEPT FAIL ERROR Total: Average Time: Total Time |
19 0 0 2 21 1:21:51.2 25:55:13.5 |
18 0 0 3 21 0:58:19.0 17:29:41.5 |
14 1 1 5 21 0:07:10.7 1:54:51.4 |
21 0 0 0 21 0:01:49.7 0:38:23.0 |
72 1 1 10 84 |
|
| Key: | GOOD – AI model produced a relatively good result with adequate detail and few errors, none of which were major. ACCEPT – AI model produced a result with adequate information about the image and few serious errors. Some important points may have been left out or were incorrect. FAIL – AI model did not produce a meaningful or useful result. ERROR – AI model produced an error message rather than a result. This message usually indicated insufficient memory or other resources. |
|||||
Models that failed
The models listed below failed to produce adequate results for either text and image exploration.
99 Models That Failed AI Exploration (Click to expand)
| all-minilm:33m aya-expanse:8b aya:35b aya:8b bespoke-minicheck:7b bge-large:335m bge-m3:567m cogito:14b cogito:3b cogito:8b comman-a:111b command-r7b-arabic:7b command-r7b:7b command-r:35b deepscalar-r:1.5b deepseek-llm:7b deepseek-v3:671b deepseek-r1:7b deepseek-v2.5:236b deepseek-v2:16b deepseek-v2:236b dolphin-llama3:8b dolphin-mistral:7b dophin3:8b embeddinggemma:300m |
everythinglm:13b exaone-deep:7.8b exaone3.5:7.8b falcon3:3b firefunction-v2:70b gemma2:9b gemma3:1b gemma3n:e2b gemma3n:e4b gemma:7b glm4:9b granite-embedding:278m granite3-guardian:8b granite3.1-dense:2b granite3.1-moe:3b granite3.2:2b granite3.3:2b granite4:1b granite4:3b hermes3:3b hermes3:8b internlm2:7b llama2-uncensored:7b llama2:13b llama3.1:8b |
llama3.2:3b llama3.3:70b llama3:8b llama4:16x17b mistral-large:123b mistral-nemo:12b mistral-small:22b mistral:7b mistrallite:7b nemotron-3-nano:30b nemotron-mini:4b nemotron:70b nuextract:3.8b olmo-3:7b olmo2:13b olmo2:7b openthinker:7b orca2:13b orca2:7b phi3.5:3.8b phi3:14b phi3:3.8b phi4-mini-reasoning:3.8b phi4-mini:3.8b phi4:14b |
qwen2.5:14b qwen2.5:3b qwen2.5:7b qwen2:7b qwen3-embedding:8b qwen:14b r1-1776:70b reflection:70b rnj-1:8b sailor2:8b smallthinker:3b smollm2:1.7b snowflake-arctic-embed:335m snowflake-arctic-embed2:568m solar-pro:22b stable-beluga:13b stablelm2:1.6b stablelm2:12b starling-lm:7b tulu3:70b tulu3:8b wizardlm2:7b yarn-llama2:13b zephyr:7b |
Lessons Learned
- I tested 128 combinations of models and model parameters for Ollama LLMs and 3 models for ChatGPT.
- Using 2 Windows computers, I tested all text analysis and image analysis for all 128 Ollama models for a total of 512 tests. (2 computers x 2 data types x 128 models = 512 tests.)
- I found a total of 38 “Good” results for text and 35 “Good” results for still images. That’s a 14.3% overall success rate. Over 85% of all of these tests failed when held to a quality standard of “good.” About 1 test in every 7 produced “good” results.
- Still focusing on these Windows tests, 40 models produced at least one “Good” response across text and image analyses.
- Over 2 out of every 3 models failed at both text and image analysis tasks.
- But that over-states the results. Over half (54.4%) of the 160 tests performed on just these 40 models on Windows failed.
- I could not test all 128 models on both of my macOS computers. Processing times were too long for this to be practical.
- I did, however, run all 40 models that produced at least 1 “Good” results in the Windows tests on macOS for both text analysis and image analysis.
- 60.3% of the 320 tests performed on these 40 models failed.
- Results on macOS were worse than results on WIndows, especially for text analysis. (This could be due to memory limitations on my macOS computers. It could also be that some models don’t work on Apple’s hardware and OS.)
- This suggests a challenging environment for qualitative researchers. The task of picking a model or set of models is complicated. (It is why I’ve included so much detail in this article.)
- With that caveat, I can also say that some Ollama models produced high quality results, rivaling, and in the case of image analysis, surpassing the results from the current ChatGPT model tests with the same prompts and data.
-
Results are sometimes different with the same prompt and the same data and the same settings on different computers. Reducing the Temperature setting will likely address some of this. (In retrospect, I wish I’d chosen 0.3 for the Temperature setting for the Text tests, as I did for the Image tests, to increase consistency across different variations of the same model.)
-
Hardware matters. Because I had only a few computers to test with, I can’t sort out all the factors. Your computer will likely differ from mine, so your results will differ from mine. Here are my speculations:
- As a generalization, the more memory (RAM) a computer had, the more models ran successfully, and the better the quality was of those responses.
- I still haven’t figured out why some tests failed sporadically, especially on the 32 GB Windows computer.
- My Windows computers ran tests a lot faster than my macOS computers. There is a 16x difference in average processing time between the slower Mac and the faster Windows computer across all of the models tested that produced at least one “good” result. The fastest computer ran 16 times faster than the slowest.
- My Windows computers both have Nvidia GPUs, which probably helped a lot.
- It appears that, despite their relatively poor performance, the GPU features of the Apple M1 and M2 processors in my Macs were engaged in some processing. Newer Apple processors (M3, M4, and M5) might perform better. I don’t have a way to test this at this time.
- The Windows computers also have more RAM than the Macs, which is likely a confounding variable here. Both Macs had only 8 GB of RAM, and, due to the infinite wisdom of Apple, neither is upgradable.
The following table summarizes which models produced “good” results for which kinds of data on which operating systems.
Summary of Overall Results (Click to expand)
Both text and images
| Model | Percent | OS Text – OS Images |
|---|---|---|
| gemma3:12b | 100.0% | Both – Both |
| ministral-3:3b | 100.0% | Both – Both |
| ministral-3:8b | 87.5% | Both – Both |
| mistral-small3.2:24b | 77.8% | Both – Both |
| devstral-small-2:24b | 75.0% | Both – Both |
| qwen3-vl:8b | 75.0% | Both – Both |
| ministral-3:14b | 62.5% | Windows – Both |
| qwen3-vl:30b | 62.5% | Windows – Both |
| llama3.2-vision:11b | 55.6% | Mac – Both |
Text Only
| Model | Percent | OS |
|---|---|---|
| devstral:24b | 60.0% | Both |
| falcon3:10b | 60.0% | Both |
| marco-o1:7b | 60.0% | Both |
| mixtral:8x7b | 60.0% | Both |
| granite3.2:8b | 50.0% | Windows |
| falcon3:7b | 40.0% | Windows |
| granite3.3:8b | 40.0% | Mac |
| phi4-reasoning:14b | 40.0% | Windows |
| qwen3:4b | 40.0% | Both |
| qwen3:8b | 40.0% | Both |
| deepseek-r1:14b | 25.0% | Mac |
| magistral:24b | 25.0% | Windows |
| olmo-3.1:32b | 25.0% | Windows |
| olmo-3:32b | 25.0% | Windows |
| qwen2.5:7b | 25.0% | Windows |
| qwen3:14b | 25.0% | Windows |
| qwq:32b | 25.0% | Windows |
| gpt-oss:20b | 20.0% | Windows |
| granite3.1-dense:8b | 20.0% | Mac |
Images Only
| Model | Percent | OS |
|---|---|---|
| gemma3:4b | 100.0% | Both |
| granite3.2-vision:2b | 100.0% | Both |
| qwen3-vl:2b | 100.0% | Both |
| qwen3-vl:4b | 100.0% | Both |
| bakllava:7b | 75.0% | Both |
| llava-llama3:8b | 75.0% | Both |
| llava-phi3:3.8b | 75.0% | Both |
| llava:13b | 75.0% | Both |
| llava:7b | 75.0% | Both |
| minicpm-v:8b | 75.0% | Both |
| qwen2.5vl:3b | 50.0% | Windows |
| qwen2.5vl:7b | 50.0% | Windows |
Please note that these results do not compare the quality of different AI responses within the “Good” result category. Some “good” results were better than others.
This is, of course, all part of a rapidly changing landscape. Different models have different designs and capabilities. New Ollama models come out frequently. New chips are announced regularly. I can speed up AI processing on my slowest Mac by linking it to the Ollama server on my fastest Windows computer. This page only scratches the surface.
Evaluating AI Results – Impressions
I am currently in the process of analyzing the “Good” results from the models presented in the data tables above.
So far, I have coded the responses from ChatGPT 5 and 10 of the “good” Ollama results. The majority of the responses are pretty good, and several findings seem to be emerging so far:
- I had expected the ChatGPT response to be clearly superior to the Ollama model responses. This has not been the case. Many of the responses I’ve evaluated so far seem roughly equivalent. In particular, some Ollama models identified features of the test image that ChatGPT did not, such as the location where the photo was taken.
- The text responses are surprisingly different from each other. As I code these responses, I have found less code re-use across models than I expected. Different model seems to find different salient features to highlight.
- One way that a few models differ in text responses is that they are more likely to make mistakes. Some conflate different people. Some make statements that I don’t feel the data supports. I’ve seen several instances of attributing quotes to the wrong person, and even an instance or two of making up quotes out of thin air.
- With image summaries, some models identified the location of the image. Some were more specific, naming a city. A few that mentioned locations that were incorrect or called the building a temple rather than an amphitheater.
These impressions are very preliminary. I am in no way approaching theoretical saturation at this time. And to be fair, I’ve cherry-picked models that are most likely to have the best results so far, so these findings may not hold up. At this time, I must focus on getting the release containing embedded AI published, so am putting this analysis indefinitely on hold.
Asking AI to Evaluate AI
The task of evaluating and ranking the AI results produced by all of these tests, as described above, proved quite difficult and time-consuming. It is a task that I have not had adequate time to complete as of this time of this writing. Then it occurred to me that I could ask AI to handle this task.
Text
I created Quotes of the juror descriptions from the “Who Are the Jurors?” AI Summaries of transcripts described above. I then explored the resulting Collection using the following query:
The following are descriptions of the jurors in the movie “12 Angry Men” created by different AI models. Which 5 models do the best job? Please justify your response with quotes from the different descriptions.
I started with model qwen3-vl:8b and recorded the results from that model. I repeated the query with with the most highly rated of models in each of the summaries produced. I continued with this process until I found a consensus of the10 ranked “top” models according to the 10 top-ranked models. This produced the following rankings:
Models Ranked (by AI) for Description of Text (Click to expand)
| Rank | Model> | Mentions | Points |
| 1 | ChatGPT 5 | 8 | 38 |
| 2 | ChatGPT 5.1 | 7 | 24 |
| 3 | ChatGPT 5.2 | 7 | 23 |
| 4 | qwen3:14b | 4 | 12 |
| 4 | qwen3-vl:8b | 4 | 12 |
| 6 | ministral-3:8b | 3 | 8 |
| 7 | devstral:24b | 4 | 7 |
| 8 | mistral-small3.2:24b | 2 | 6 |
| 9 | devstral-small-2:24b | 2 | 4 |
| 9 | gemma3:12b | 1 | 4 |
For our text prompt, ChatGPT produced the best results, according to our AI models. Interestingly, the models preferred ChatGPT 5 to the newer ChatGPT 5.1 and ChatGPT 5.2 models. Unfortunately, these ChatGPT models do not offer data security and privacy the way the remaining models do. qwen3:14b might be a good choice for Windows users, and qwen3-vl:8b, ministral-3:8b, and devstral:24b do reasonably well on both Windows and macOS.
Images
I then explored the still image descriptions, using the top 10 text models listed above. (That is, I used the best image models to generate text descriptions of the images, which were then evaluated using the best text models.) I used the following prompt:
The following are descriptions of a photo of Taormina, Italy created by different AI models. Which 5 models do the best job? Please justify your response with quotes from the different descriptions.
Models Ranked (by AI) for Description of Still Image (Click to expand)
| Rank | Model> | Mentions | Points |
| 1 | gemma3:12b | 10 | 42 |
| 2 | ChatGPT 5 | 7 | 21 |
| 3 | qwen3-vl:30b | 7 | 17 |
| 4 | devstral-small-2:24b | 4 | 16 |
| 5 | mistral-small3.2:24b | 6 | 14 |
| 6 | gemma3:4b | 3 | 13 |
| 7 | qwen3-vl:8b | 3 | 9 |
| 8 | ministral-3:14b | 3 | 6 |
| 9 | ministral-3:8b | 2 | 4 |
| 10 | qwen2.5vl:3b | 2 | 3 |
For our image prompt, gemma3:12b is rated highest by our AI models, followed by ChatGPT 5, qwen3-vl:30b, and devstral-small-2:24b. All of these models work well on both Windows and macOS when exploring images. As with text exploration, ChatGPT 5 does not offer the same confidentiality protections the other models here do. I also find it interesting that the newer ChatGPT 5.1 and ChatGPT-5.2 models did not make the cut while the older ChatGPT 5 model did.
The Last Word
I want to emphasize one last point as part of this discussion. For both text and for images, I asked 10 models to rank the AI results I had generated. Each model did so, assertively and confidently. And while some consensus emerged across the 10 models, these models disagreed far more than they agreed. No single model agreed with this consensus, and no two models agreed with each other.
At one point, I accidentally ran one of the image tests twice. The selections and rankings of the two identical test runs were quite different, even though the Temperature setting used (0.3) should have limited the amount of randomness coming out of the AI model. So the AI models don’t even agree with themselves!
This reinforces in my mind that all AI results must be reviewed and checked against the data by a researcher before being reported as a research finding, and that all use of AI in qualitative analysis must be carefully described in qualitative write-ups and presentations. AI comments on qualitative data may be interesting, and they may sometimes suggest useful ideas. It’s vital to recognize that there is no actual, real “intelligence” behind AI, even when the clever application of high-level mathematics makes it appear so.
Large language models do not, cannot, and will not “understand” anything at all. They are not emotionally intelligent or smart in any meaningful or recognizably human sense of the word. LLMs are impressive probability gadgets that have been fed nearly the entire internet, and produce writing not by thinking but by making statistically informed guesses about which lexical item is likely to follow another.
What Happens When People Don’t Understand How AI Works by Tyler Austin Harper, The Atlantic, June, 2025.
