Chapter 4 Documentation & Text Editors

In this chapter we will cover a software tool used for data documentation. You will learn how to record the necessary information that will accompany your data, and how to use the tool to modify existing information.

4.1 Data Documentation

In the previous chapter you learned how to properly enter and format your research data and as stated before, this is only half of the actual data. The other half is the documentation for your data. Documentation is the information needed to to make the data understandable and usable. It describes the datas origin, variables and how they were recorded, the cleaning steps and and changes made to the data, and the intended use of the data: it represents the context.

A good analogy to picture the spreadsheet data and the documentation is baking a cake:

The spreadsheet data represents the ingredients, the documentation is the recipe, and the cake is the output after analysis. You can’t conduct an analysis without data (you need the ingredients to bake a cake), and you need to understand what to do for an alaysis and how to explain what you did (you need to know what ingredients to use and how to use them).

4.1.1 Importance

Throughout the chapters thus far, you have been taught to be proactive with organization and data entry. We will refer back to this ideology when setting up our documentation file(s). We want to minimize the potential for personal errors, but in this case we want to minimize the potential for errors or misinterpretation by others.

Think of it in terms of baking: Without the recipe, how would someone with an allergy know if the cake is safe for them to eat? What ingredients did you use? What did you do to avoid cross-contamination?

If you are still confused about why the context is important, here is a scenario commonly seen among ecology researchers:

You are a new masters student conducting a research project that looks at fish size and migration that will build on data collected in previous years from another students project on the same species. You were lucky to have the help of other members from the research lab for data collection. Now it has come time for analysis and you have to combine the data you have collected, with the data from the previous years. You had everything combined but can’t figure out why your results aren’t making sense, so you go to your supervisor for help.

You find out that the size data from previous years was recorded in inches, and some of yours looks to be a mix of cm and mm. You aren’t sure what is right so you go back to the logbooks and nowhere on that date does it say how it was recorded, or who was there. Now you aren’t sure what to do as you can’t verify who was there for certain on each collection day, or if they will remember exactly how they measured, in order to convert length to inches for your analysis.

We want to produce a research project that is reproducible, or something that can be replicated. In order to ensure information is effectively communicated, shared, or used, we need to be able to know exactly what you did.

4.1.2 Best Practices

It may seem tedious or unnecessary to include this information, but it is the best way to minimize the potential for errors. To ensure you are providing detailed documentation to accompany your research data, follow these best practices:

1. Data Dictionary

Include a data dictionary that will explain your variables in detail. This means the variable name, the variable type or class (i.e. character, numerical), a description of what the variable is (including units if any), and all of the allowable values. Here is an example:

Variable Class Description Allowable Values
gear_type character gear used to collect sample net, rod, trap
year integer year of collection 2022, 2023, 2024
kept logical sample was retained for use Y, N
length_cm numeric length of fish in cm number > 0

2. Source & Collection

Sometimes data can come from an external source, you need to indicate where the data comes from and provide reference to the source.

If you collected data, you need to describe how the data was collected and detail any processing steps taken (i.e. dissections or genetic sampling methods). Make sure to include dates throughout this description.

3. Format & Structure

You want to include details about each file including the file type, relationships to other files, or any hierarchical structure within the data (if any).

4. Transformations & Analysis

Make sure to write down the steps taken to clean or wrangle the data, and any information pertaining to uncertainties or errors within the data. You would also include any known bias, or limitations with the data here.

Although many peer reviewed journals do not require analysis files or publish files related to this, a great way to ensure detailed documentation is including the actual code used to conduct your analysis (more on this in later chapters).

5. Access & Permissions

You should specify who has access to the data, and what they are allowed to do with it. This is important to ensure proper security of data, and giving credit where credit is due.

Include a data contact as a way to provide others a method of contacting you with questions regarding the data or its use.

4.1.3 Format

Documentation can be very overwhelming, and often isn’t 1 file. There can be a form of organization similar to that of your computer (i.e. all documentation files included in 1 folder). Often times a README file is used to explain to someone what is going on within the other files included within the folder or project, and is a great way to ensure documentation with multiple files is well understood.

SHOW A VISUAL OR TWO

The documentation files themselves can be in different formats but because the majority of documentation is explanations or text, we commonly see .txt files used for this. In the next section we will dive deeper into .txt files!

4.2 Text Files & Editors

Talking about text files and text editors opens up an entirely new world that can easily become overwhelming. You can do a lot with text files and text editors but we will only be focusing on their purpose for research data management.

4.2.1 Text or .txt

A text file is a file on your computer that contains text, but the type of file determines if this is plain text or also contains special formatting (i.e. bold, italics) or images. Microsoft Word (.doc or .docx) are examples of text files.

Some text files are opened using a word processor like Microsoft Word, which can render more complex text that contains fonts, styles, images, etc.

A .txt represents a type of text file that is the most basic text file formats. It creates simple text documents with little to no formatting. It ca be opened with a text editor.

DO WE WANT TO GO INTO TEXT FILES FOR EXPORTING DATA OR STICK TO .CSV’S

4.2.2 Text Editor

A text editor is a software tool that you can use to create, view, or manipulate .txt files. The difference compared to using Excel, there is no variable formatting, or data types, it is just text.

There are different text editors that you can use (Mac users have a built-in text editor on their computers), and it is entirely up to personal preference as to which editor you use.

A great, and free to use, text editor is Sublime Text. If you don’t have it installed on your computer, click here to download. Once on the installation page, follow the instructions for downloading Sublime Text on your computer (i.e. select the link for the operating system you are using).

4.2.3 Applications

We can use text and .txt files throughout our data documentation process, as well as data storage and manipulation.

Because text files are made entirely of plain text, we can use the lack of formatting to our advantage and solve issues when/if they arise! A data file open in a text editor has 1 line of data in each line where each cell is defined using a delimiter. For example, a .csv uses a a comma as a delimiter. This means that on each line of the file, a , indicates the next value! You can have different delimiters but remember to choose something that won’t be found throughout the values themselves as this can cause parsing errors.

Saving data as a .txt file ensures no changes in the values due to software specific formatting issues.

Here are some examples of troubleshooting problems that can be solved using a text editor:

  1. Find and change/remove special characters that are not recognized on other software
  2. Compare 2 different data files
  3. Check out a file that you don’t have the correct softeware needed to use
  4. Open/View data to avoid formatting changes with dates/other values

4.3 Chapter Wrap-Up

Overall, context is everything. We need to know what you did, how you did it, when you did it, where you did it, and why you did it. You should now be able to take everything that you have learned about files on your computer, Excel for data entry, data documentation, and text files and approach your research projects more prepared and organized.

In the next few chapters, we will dive into what you need to know for a major component of your research projects R. A first, we will start slow and ease you into how to install, set-up, and navigate the software before getting into more complicated applied concepts.

What you have learned thus far will be used and referenced often, which is why it is important to understand. Make sure to refer back and refresh yourself when needed, it is a lot to take in!

4.3.1 Chapter Terms & Definitions

Here is a summary of some of the bolded terms used throughout this chapter, refer back to this list whenever you need a refresher!

PUT TERM LIST HERE