Working with PDF Data

Working with PDF Data

The PDF File Format

PDFs are hybrid documents. They can include text, which researchers may wish to treat like text from an analytic point of view. They can include images and elements of visual layout and presentation, which researchers may wish to treat like still images from an analytic point of view. Often, layout matters. Because of this, PDFs are not editable the way text files usually are.

It is helpful to realize that the PDF format is more concerned about consistent data presentation than other text documents. This has two implications for qualitative analysis.

First, PDF documents have no coherent concept of order. Elements are stored in the order they were created, regardless of where they appear on the visual page. Because of this, they cannot be used as Transcripts associated with media files in Transana, which require a meaningful sense of linear order.

Second, just because something appears to be text in a PDF doesn’t mean it actually IS text. PDFs can sometimes present “letter-shaped drawings” which look like text but are unrecognizable as text to Transana. As a result, Transana is not always able to process as text what may appear to the user to be text. This is determined by the program that created the PDF, and there’s little Transana can do about it.

Analyzing PDF Data

In Transana, you have two options for analyzing PDF data. The differences, while subtle, are important to understand.

PDF Quotes

PDF Quotes are text-based selections made from a PDF. When you open a PDF Quote, Transana shows you the portion of the document you selected with the Text Rectangle coding shape you defined. However, PDF Quotes are treated as text, independent of the PDF source, in Transana’s reports. The text is available to the Word Frequency Report and to Text Searches.

(Please note that if your text selection covers letter-shaped graphics rather than actual text in the PDF Document, you may need to enter the text into the PDF Quote manually during the PDF Quote creation process.)

PDF Snapshots

PDF Snapshots are image-based selections made from a PDF. PDF Snapshots are treated as still images, even if they contain text elements. They are displayed as graphics in Transana’s reports, and have no awareness of any letters they may contain.

PDF Quotes and Snapshots in Text Reports

As you can see below, a PDF Quote is presented at text in Transana’s text-based reports. Only the extracted text is displayed.

PDF Snapshots are treated as image files. The framing of the PDF in the PDF Window when the PDF Snapshot is last saved determines the visual appearance of the PDF Snapshot in Transana’s text-based reports. If too much of the PDF is displayed, you may want to edit the PDF Snapshot to zoom in and re-frame the PDF Snapshot.