Using AI in Transana
tl;dr
This page starts with a comparison of External vs. Embedded AI options. It describes essential information about embedded models.
Next, the page lays out how I evaluated the numerous AI models available in Transana. I go into a lot of detail in this section so you can understand exactly how I reached my conclusions. Lessons Learned describes those conclusions.
The section called Having AI Evaluate and Rank AI provides rankings of the AI models available in Transana for exploring Text and for exploring Images. If you are here looking for guidance about what AI models to use, read this section.
But if you read nothing else, please read and re-read the last section, called The Last Word. Understanding that section is vital to all researchers who want to use AI as part of qualitative data analysis.
Introduction
To use AI well in Transana, researchers need to make several important decisions. This article highlights two major choices.
The choice between using external AI tools and embedded (internal) AI tools is fairly straight-forward, and this article lays out the advantages and disadvantages of each option.
The choice of which AI model to use when exploring research data is more challenging. While external AI tools generally offer limited choices and limited flexibility, embedded AI offers an overwhelming set of choices. The bulk of this article is dedicated to laying out these choices, describing the process I used narrow down the options, and sharing the results of my extensive testing of AI models..
External AI and Embedded AI
The first choice a researcher must make when using AI in Transana is between external AI and embedded (or internal) AI.
External AI uses an outside service such as ChatGPT, Claude, Gemini, or Co-pilot for AI queries. Currently, Transana supports ChatGPT and Claude for external exploration of data.
Embedded AI uses an AI service on a computer controlled by the researcher. Currently, Transana supports the Ollama system, which offers hundreds of individual AI models to choose from.
Each of these systems has advantages and disadvantages:
| External AI | Embedded AI |
|---|---|
|
Advantages:
Quality. Claude and ChatGPT 5 often offered excellent quality responses during my AI prompt tests. (See below.) It was less prone to making false statements than some of the other models I tested. |
Advantages:
Private, Secure. When properly configured, data is securely processed on a computer controlled by the researcher. No data is sent to external servers. |
|
Disadvantages:
Requires an account. External AI providers require user accounts. They want a credit card, and track usage. |
Disadvantages:
Requires setup. The researcher must set up their own AI server. This is fairly easy. but requires installing an additional program. (See the Ollama Setup Instructions.) Researchers must also download the model(s) they want to use. |
|
Implementation:
Transana supports OpenAI’s ChatGPT and Anthropic’s Claude AI tools to provide external AI exploration. |
Implementation:
Transana uses the Ollama system to provide embedded AI exploration. |
Privacy and Confidentiality
Data privacy, security, and confidentiality are central issues for most research projects. When a researcher uses an external AI tool, they send their data to a server that is not under their control. There are a wide variety of company policies and legal issues that influence what happens to the data once is received by the external computer. It is imperative that researchers understand the privacy and confidentiality policies of the companies they work with for external AI exploration of their data. Researchers should never submit data for external AI exploration without the explicit approval of their Institutional Review Board or other ethics board overseeing their research.
With embedded AI processing in Transana, all AI processing is handled by the Ollama server select by the researcher. This Ollama Server may reside on their own computer, or they may configure Transana to use an Ollama server on a different computer under their control. When properly configured, the Ollama server does not share or retain any data. Researchers should only connect to Ollama servers on their own computer or that is controlled by someone they trust, such as their department, university, or organization IT department. Choosing a server they control is how they ensure the privacy and confidentiality of their data during the AI exploration phase of your analysis.
Selecting Models
For external AI, Transana supports ChatGPT and Claude AI options. These tools recommend a small number of their newest models. If other external options such as Co-Pilot, Gemini, and others, were available, the evaluation and selection of external model would take a bit more work. Thus, external model selections appear simpler when compared to embedded model selection.
For Embedded AI, there are many models so choose from, models that are generally not so well known, making this choice less obvious. The remainder of this article is devoted to the topic of selecting AI models for best results.
Embedded AI Models and Model Parameters
AI Models are algorithms that determine how an AI works with data. They are trained on (typically very large) data sets and are designed to handle certain types of tasks and achieve certain types of goals. Transana supports the external ChatGPT AI service from OpenAI and the external Claude AI service from Anthropic. It also supports an embedded AI tool called Ollama to manage the download, selection, and use of pre-defined AI models to allow the embedded exploration of qualitative data.
ChatGPT and Claude offer limited options for users, effectively hiding a lot of complexity from end users. Ollama presents a more complex AI landscape, requiring more background knowledge of the researcher.
Models
ChatGPT offers a small handful of models, mostly different versions of the same set of algorithms. Recent options (as of this writing) include gpt-5, gpt-5.1,gpt-5.2, and gpt-5.4.
Claude offers Claude-Haiku, Claude-Opus, and Claude-Sonnet models with differing levels of functionality and sophistication at different price points.
Ollama offers a large catalog of models. See the Models page on Ollama’s web site for more information.
Parameters
Ollama AI models are built with a characteristic called parameters, which represents internal variables used by the AI model to map input data to outputs, influencing the model’s ability to see patterns in data. To over-simplify, the higher the value of “parameters” a model is built with, the more sophisticated a response the model should be able to generate. However, the number of parameters also affects AI processing factors such as memory requirements and processing speed.
Some Ollama models support several different parameter options, and we have determined that the parameters value is important in how models work within Transana. Models are always presented as “(model name):(parameters)” pairs, for example, “gemma3:12b” for the gemma3 model with the 12 billion parameter setting. We strongly recommend that Transana users avoid the use of “cloud” and “turbo” parameters, as these require external processing within Ollama and may compromise data confidentiality.
You can explore the full list of models available with Ollama with their parameter options through the “Models” section of the Ollama website.
Testing AI Models
(You can skip the detailed discussion of testing supported AI models by clicking here.)
Ollama offers a huge number of models. I tested as many of these models as I could in an effort to determine which models did a good job at exploring qualitative data within Transana. The next several sections of this article describe what I found. Please note that this is an ongoing process, and your results may vary.
AI Exploration of Text
For testing the analysis of text (including transcripts), I used a transcript of the movie “12 Angry Men” as my initial testing data because it was long enough (90 minutes) and complex enough (12 major speakers) to represent a potential challenge to AI, and because it is non-confidential data that others can obtain if they want to explore, replicate, or extend my test results.
While this does not represent typical qualitative research data, the narrative of the movie provides information that can be analyzed qualitatively and leads to clear conclusions, making between-model comparisons of AI results easier than if actual qualitative data had been used.
I ran the following prompt using each of about 145 Ollama models, 3 Claude models, and 4 ChatGPT models.
This is a transcript of a jury deliberation. Describe each juror in a separate paragraph, including juror number, name if known, occupation, and personality.
After a little experimentation, I settled on a context size of 32K, large enough to hold the 20K+ tokens of data I was exploring and a reasonable-sized response. I initially used a Temperature setting of 0.8, but have switched to a setting of 0.3 recently in an effort to reduce randomness and increase AI response consistency. I ran each test on multiple computers. Individual test queries could take anywhere from a few seconds to over 24 hours; any test taking longer than 24 hours was deemed a failure. I made the decision to skip some tests on slower computers that seemed unlikely to finish within this time frame for the sake of efficiency and my sanity.
A summary of the result of this testing are presented in expandable tables below. The first table presents the the model and parameter combinations that produced “good” results on at least one computer. The second table is a more detailed description of the computers used for testing.
Models for Exploring Text (Click to expand)
|
Windows Models (1)
|
macOS Models (2)
|
|
(1) The model ranked 16 does not work well on Windows. |
(2) Models ranked 4, 6, 8, 11, 12, 13, 15, and 16 do not work well my 8GB macOS computers. |
Additional Details About Testing Computers (Click to expand)
Our testing has been conducted on four computers.
macOS testing was done on two computers.
- An M2-bases Mac Mini with 8 GB of RAM. All Ollama models were stored on an external hard drive.
- An M1-based Macbook Pro with 8 GB of RAM.
Windows testing was conducted using two computers.
- An older desktop with 64 GB of RAM and a 6 GB Nvidia graphics card
- A newer laptop with 32 GB of RAM and an 8 GB Nvidia graphics card
Both computer are running Windows 11.
(If you are interested in helping help with additional testing, please let me know through the Contact Form.)
There are several important points to keep in mind when reviewing this.
- New models are coming out all the time. This is a snapshot of a moving target.
- I used a very broad prompt. In actual analysis, revising the prompt is an important step in AI exploration, and it is likely that changing the prompt will affect the AI output in both expected and unexpected ways.
- This is one example of a prompt and data. The data is from a movie, so may not represent real-world research data in important ways.
- Both macOS computers used for testing had 8 GB of RAM. It’s possible that more RAM or newer processors would allow more models to run successfully.
AI Exploration of Images
For testing still image analysis by AI, I used a photograph I took a few years ago while traveling. The image included several distinct elements that could be included in the analysis. I submitted the following simple prompt:
Describe the following image:
Simply prompting for a description of the image revealed something very interesting in the image tests. Some models described an image in response to this prompt that was clearly not the image I submitted. This description of a different image could sometimes be quite detailed. Thus, for images, this description prompt ends up revealing models that are not able to process images the way Transana submits them but that do not inform the researcher of this failure.
I settled on a context size of 48K for the image. For this test, I set the Temperature to 0.3, as I wanted to get more consistent results. I ran each test on multiple computers. As with the text tests, I stopped tests that ran over 24 hours, considering them a failure, and I made the decision to skip some tests that seemed unlikely to succeed within the 24 hour time frame for the sake of efficiency and my sanity.
The results of this testing are presented in an expandable table below. This table presents the the model and parameter combinations that produced “good” results on at least one computer.
Models for Exploring Images (Click to expand)
|
Windows Models
|
macOS Models
|
Models that failed
The models listed below failed to produce adequate results for either text and image exploration.
Models That Failed AI Exploration (Click to expand)
| all-minilm:33m aya-expanse:8b aya:35b aya:8b bespoke-minicheck:7b bge-large:335m bge-m3:567m cogito:14b cogito:3b cogito:8b command-a:111b command-r7b-arabic:7b command-r7b:7b command-r:35b deepscalar-r:1.5b deepseek-llm:7b deepseek-v3:671b deepseek-r1:7b deepseek-v2.5:236b deepseek-v2:16b deepseek-v2:236b dolphin-llama3:8b dolphin-mistral:7b dophin3:8b embeddinggemma:300m everythinglm:13b |
exaone-deep:7.8b exaone3.5:7.8b falcon3:3b firefunction-v2:70b gemma2:9b gemma3:1b gemma3n:e2b gemma3n:e4b gemma:7b glm4:9b granite-embedding:278m granite3-guardian:8b granite3.1-dense:2b granite3.1-moe:3b granite3.2:2b granite3.3:2b granite4:1b granite4:3b hermes3:3b hermes3:8b internlm2:7b lfm2.5-thinking:1.2b lfm2:24b llama2-uncensored:7b llama2:13b llama3.1:8b |
llama3.2:3b llama3.3:70b llama3:8b llama3:8b-instruct-q2_K llama4:16x17b mistral-large:123b mistral-nemo:12b mistral-small:22b mistral:7b mistrallite:7b nemotron-3-nano:30b nemotron-3-nano:4b nemotron-mini:4b nemotron:70b nuextract:3.8b olmo-3:7b olmo2:13b olmo2:7b openthinker:7b orca2:13b orca2:7b phi3.5:3.8b phi3:14b phi3:3.8b phi4-mini-reasoning:3.8b phi4-mini:3.8b |
phi4:14b qwen2.5:14b qwen2.5:3b qwen2:0.5b qwen2:7b qwen3-embedding:8b qwen:14b r1-1776:70b reflection:70b rnj-1:8b sailor2:8b smallthinker:3b smollm2:1.7b snowflake-arctic-embed:335m snowflake-arctic-embed2:568m solar-pro:22b stable-beluga:13b stablelm2:1.6b stablelm2:12b starling-lm:7b tinyllama:1b tulu3:70b tulu3:8b wizardlm2:7b yarn-llama2:13b zephyr:7b |
Summary and Lessons Learned
- I tested 150 combinations of models and model parameters for Ollama, Claude, and ChatGPT.
- Using 2 Windows computers, I tested all text analysis and image analysis for all 150 models for a total of 600 tests. (2 computers x 2 data types x 150 models = 600 tests.)
- Tests on macOS tool a LOT longer than tests on Windows.
- Using 2 macOS computers, I ran 149 tests for text and 143 tests for images, for a total of 292 tests.
- Both Windows computers engaged GPUs on NVidia graphics cards, and both macOS computers utilized the GPUs in their Apple processors for AI processing.
- (Total test run time was over 179 hours for Windows for 600 tests and over 632 hours on macOS for fewer than 300 tests.)
- AI Model matters.
- A significant percentage of Ollama models failed to produce reasonable results to these test queries. Of the 150 Ollama models tested, 41 produced “good” results on at least one test computer in our text tests and 36 produced at least one “good” result in our image tests.
- External models fared better in this respect. All Claude and ChatGPT models tested produced good results for both text and image data.
- When using the same prompt on the same data with different models, AI results can differ significantly. This is not surprising. However, with some prompts, even the same model will produce different results when run repeatedly.
- This suggests a challenging environment for qualitative researchers with confidential data. The task of picking a model or set of models for embedded AI data exploration can be a bit complicated.
- Hardware matters.
- Because I had only a few computers to test with, I can’t sort out all the factors. Your computer will likely differ from mine, so your results will differ from mine. The following is speculation.
- As a generalization, the more memory (RAM) a computer had, the more models ran successfully, and the better the quality was of those responses.
- I still haven’t figured out why some tests failed sporadically, especially on the 32 GB Windows computer.
- My Windows computers ran tests a lot faster than my macOS computers.
- Newer Apple processors (M3, M4, and M5) might perform better. I don’t have a way to test this at this time.
- The Windows computers also had more RAM than the Macs, which is likely a confounding variable here. Both Macs had only 8 GB of RAM, and, due to the infinite wisdom of Apple, neither can be is upgraded.
This is, of course, all part of a rapidly changing landscape. Different models have different designs and capabilities. New Ollama models come out frequently. New chips are announced regularly. I can speed up AI processing on my slowest Mac by linking it to the Ollama server on my fastest Windows computer. This page only scratches the surface.
Having AI Evaluate and Rank AI
The task of evaluating and ranking the AI results produced by all of these tests, as described above, proved quite difficult and time-consuming. It is a task that I have not had adequate time to complete as of this time of this writing. Then it occurred to me that I could ask AI to handle this task.
Text
I created Quotes of the juror descriptions from the “Who Are the Jurors?” AI Summaries of transcripts described above. I then explored the resulting Collection using the following query:
I started with model ministral-3:8b, which has historically produced good results for me, and recorded the results from that model. I repeated the query with with the most highly rated of models in each of the summaries produced. I continued with this process until I found a consensus of the “top” ranked models. These rankings are available in the following expandable table
Models Ranked (by AI) for Description of Text (Click to expand)
| Rank | Model | Mentions | Points |
| 1 | claude-opus-4-6 | 5 | 198 |
| 2 | claude-sonnet-4-6 | 5 | 192 |
| 3 | ministral-3:8b | 5 | 169 |
| 4 | gpt-5.4 | 4 | 153 |
| 5 | gemma4:31b | 4 | 138 |
| 6 | claude-haiku-4-5 | 4 | 135 |
| 7 | devstral-small-2:24b | 4 | 128 |
| 8 | gemma4:26b | 4 | 126 |
| 9 | qwq:32b | 4 | 122 |
| 10 | gpt-5.1 | 3 | 120 |
| 11 | mistral-small3.2:24b | 4 | 119 |
| 12 | gpt-5.2 | 3 | 115 |
| 13 | qwen3-vl:30b | 4 | 114 |
| 14 | gpt-5 | 3 | 113 |
| 15 | magistral:24b | 4 | 112 |
| 16 | granite3.3:8b | 5 | 111 |
| 17 | olmo-3:32b | 4 | 107 |
| 18 | qwen3:14b | 4 | 103 |
| 19 | gemma4:e4b | 4 | 101 |
| 20 | mixtral:8x7b | 4 | 83 |
For non-confidential data, claude-opus is ranked best with claude-sonnet coming in a close second. gpt-5.4 came in 4th. However, these models are likely not suitable for use with confidential or sensitive data.
For confidential data, Ollama’s ministral-3:8b model was ranked at number 3, followed by gemma4:31b (#5), devstral-small-2:24b (7), gemma4:26b (8), and qwq:32 (9). Computers with only 8 GB of RAM may struggle or run slowly with large models, and might look at ministral-3:8b (3), granite3.3:8b (16, Mac only), gemma4:e4b (19), and ministral-3:3b (21) that require less overall memory.
Images
I also explored the still image descriptions, using the top text models listed above. (That is, I used the best image models to generate text descriptions of the images, which were then evaluated using the best text models.) I used the following prompt:
The following are descriptions of a photo of Taormina, Italy created by different AI models. Which 5 models do the best job? Please justify your response with quotes from the different descriptions.
The results are available in the expandable table below:
Models Ranked (by AI) for Description of Still Image (Click to expand)
| Rank | Model | Mentions | Points |
| 1 | claude-opus-4-6 | 5 | 179 |
| 2 | claude-sonnet-4-6 | 5 | 176 |
| 3 | claude-haiku-4-5 | 5 | 170 |
| 4 | qwen3-vl:30b | 5 | 157 |
| 5 | gemma3:12b | 5 | 153 |
| 6 | gemma4:31b | 5 | 150 |
| 7 | qwen3-vl:8b | 5 | 138 |
| 8 | gemma4:26b | 5 | 128 |
| 9 | gemma4:e4b | 5 | 123 |
| 10 | gpt-5 | 5 | 118 |
| 11 | qwen3.5:9b | 5 | 114 |
| 12 | qwen3.5:27b | 5 | 110 |
| 13 | qwen3.5:4b | 5 | 106 |
| 14 | gemma4:e2b | 5 | 105 |
| 15 | ministral-3:14b | 5 | 105 |
| 16 | qwen3-vl:4b | 4 | 101 |
| 17 | gpt-5.1 | 5 | 99 |
| 18 | mistral-small3.2:24b | 4 | 93 |
| 19 | gpt-5.4 | 4 | 87 |
| 20 | gpt-5.2 | 4 | 80 |
For non-confidential images, the claude models, opus, sonnet, and haiku, were all ranked quite well, ranked as the top three models by all AI models that assessed image summaries. gpt-5 was the best model from OpenAI, coming in at number 10, significantly higher than several newer gpt models.
For confidential images, qwen3-vl:30b (4), gemma3:12b (5), gemma4:31b (6), and qwen3-vl:8b (7) were highly ranked. For computers with 8 GB of RAM, gemma3:12b (5), qwen3-vl:8b (7), and gemma4:e4b (9) are worth consideration and have lower memory requirements.
The Last Word
I want to emphasize one last point as part of this discussion. For both text and for images, I asked several models to rank the AI results I had generated. Each model did so, assertively and confidently. And while consensus emerged across the models, these models disagreed far more than they agreed. No single model agreed with this consensus, and no two models agreed with each other.
At one point, I accidentally ran one of the image tests twice. The selections and rankings of the two identical test runs were quite different, even though the Temperature setting used (0.3) should have limited the amount of randomness coming out of the AI model. So the AI models don’t even agree with themselves!
This reinforces in my mind that all AI results must be reviewed and checked against the data by a researcher before being reported as a research finding, and that all use of AI in qualitative analysis must be carefully described in qualitative write-ups and presentations. AI comments on qualitative data may be interesting, and they may sometimes suggest useful ideas. It’s vital to recognize that there is no actual, real “intelligence” behind AI, even when the clever application of high-level mathematics makes it appear so.
Large language models do not, cannot, and will not “understand” anything at all. They are not emotionally intelligent or smart in any meaningful or recognizably human sense of the word. LLMs are impressive probability gadgets that have been fed nearly the entire internet, and produce writing not by thinking but by making statistically informed guesses about which lexical item is likely to follow another.
What Happens When People Don’t Understand How AI Works by Tyler Austin Harper, The Atlantic, June, 2025.
