A Python’s Guide to Columnizing with Text Files

In this tutorial, I will provide a step-by-step guide on the thinking process that went into the creation of a Python program that will allow a user to columnize a table made in Excel, as well as Google Sheets. This will be done on Visual Studio Code (VS Code), making use of text files to ensure that no external library is required for the algorithm, since the ability to interface with a text file has already been integrated into Python. To summarize, “columnizing” is a process where a column of a table with many repeated data is turned into many new columns, bringing with it another value that corresponded to the original column. For example, in a table containing the number of students graduating each year with a certain major, we have 3 fields: the name of the major, the year of graduation and the number of graduates. We can make the years of graduation (2001-2021) become 21 new columns labelled 2001 to 2021 with the number of graduates located in the corresponding locations within them. This technique allows for a new way to visualize a set of data in the form of a table for any disciplines that uses them (including Digital Arts & Humanities) and can lead to new discoveries regarding the data thanks to this change in perspective.

For those who are impatient, the full code is available at the bottom. However, I can’t guarantee that you will be able to understand all of it faster than if you read through my tutorial.

1. Create 3 new files in the same folder: one for the input table, one for the program itself and the last is for the output.

The first and last files can be the same file if desired, though it will lead to the input data being deleted so you should keep another copy of it safe somewhere else. For the tutorial, I have chosen the names “ColumnizeData.txt”, “ColumnizeProcess.py”, and “ColumnizeResult.txt” respectively, but it can be “a.txt”, “b.py”, and “c.txt” or anything you want.

2. Right click on the input (Data) file and output (Result) file and select one of the Copy Path.

This is to tell the algorithm (Process) where to get the data and where to write the processed data.

Right Click on the filename and choose either of the Copy Path options

3. Put the links into the following form, as well as create a list that will hold our eventual output.

data="D:\VS Code\Python\Projects\Academic\ColumnizeData.txt"
result="D:\VS Code\Python\Projects\Academic\ColumnizeResult.txt"
output=[]

The reason why it’s a list is because Python will read the text file as a list, so it’s just for keeping the same theme. To reference, here is a part of the dataset I plan to use. However, any dataset can be used here, as long as it’s a table and has titles as the first row.

4. Ask Process to read the Data like so:

file=open(data,"r")
items=file.readlines()
file.close()

It seems very simple, but we’ve obtained the dataset we need.

5. Select from the dataset headers:

Identifiers for the data (indexField)
Which field we want to columnize (columnizedField)
Which values are carried to the new columns (valueField)
The separators used (separators)

This needs to be done manually since the user will choose which column is used for which job. In this case, I will columnize Class of Year and carry over the Total number of graduates. Due to the repeating nature of the data, in this case being each subject repeating once per year, the index (indexField) would also be needed. Finally, we need to be able to separate each read line into individual words to we can run our program, so the separators translated from Sheets to VS Code also need to be identified. In most cases this is a tab (“\t” in Python) but to make sure, you can print the items obtained to see it.

indexField="CIP Code" # Identify the entries that will gain new fields
columnizedField="Class of Year" # The field that becomes new column headers
valueField="Total" # The field with values that will be carried across the new columns
separator="\t" # What Python consider the separator to be when put into a text file

The arrows signifying tabs appearing when highlighted

By typing “print(items)” into the program and running it, we can see the separator as seen by VS Code applied to the headers as the first item in the ‘items’ list:

The headers with \t separators

6. Use the .split() method to turn each line into words and .strip() to remove the ‘\n’ at the end.

for i in range(len(items)): items[i]=items[i].strip().split(separator)

The reason to remove the ‘\n’ is because it will ruin the final output by adding in new lines that we don’t want.

The line above will make each item become a new list:

Transformation of the items

7. Change the fields into index numbers.

The identifying variables must be indices corresponding to each piece of data in each item to allow the Process to reference each field of data. We can make use of the fact that the first item contains the headers, hence having the same content as indexField, columnizedField and valueField like this:

indexField = items[0].index(indexField)
columnizedField = items[0].index(columnizedField)
valueField = items[0].index(valueField)

This will automatically search for the corresponding field indices, so make sure you have used the exact same name initially as contained in the header.

8. Create 2 new dictionaries and the supporting variables.

As we are iterating through the text file line by line, new information will need to be assigned to old entries. With my dataset as an example, this means we need to identify where the later Total values will be placed. There are 2 things to consider: keeping the value to the same subject and keeping the value to the same year. However, since most of the time those fields won’t have rounded numbers like 1, 2 and 3, we need to create a few dictionaries to help us translate indices and year numbers to rounded indices:

indexInResult = {} #e.g. 3.0103:0 (its index as an entry in the output)
indexCounter = 1 #To assign the indexInResult values
newColumns = {} #e.g. 2021:4 (its index as a field in the output)
entryLength = len(items[0])-2 #To assign the location of each new column

entryLength is the length of each entry in the output before adding in new columns, but due to how Python indexing works, this also gives the next empty column that we can put the columnized years into. The columnizedField and valueField will be removed, hence the -2 at the end.

9. Next, enter in all the unique years that will be columnized into the newColumns dictionary, as well as track how long each new entry (aka line) of the output is.

for i in range(1,len(items)):
    if items[i][columnizedField] not in newColumns:
        newColumns[items[i][columnizedField]] = entryLength
        entryLength+=1 #To assign any new columns

10. Add in the header of the output. It’s very different from the rest so it needs to be done separately.

# First line of "output"
newEntry=[]
for data in range(len(items[0])):
    if data!=columnizedField and data!=valueField:
        newEntry.append(items[0][data])
for eachColumn in newColumns:
    newEntry.append(eachColumn)
output.append(newEntry)

After all of that, let’s finally columnize the data.

11. Columnize the data

a. Start the process by shortening a few variables.

for entry in range(1,len(items)):
    index = items[entry][indexField] #e.g. 3.0103
    column = items[entry][columnizedField] #e.g. 2021
    value = items[entry][valueField] #e.g. 15

As we iterate, there are 2 possible cases for each line: either a line has a subject we haven’t seen yet, or we have.

b. For the first case, we need to find where it will go by filling a new entry into the indexInResult dictionary, create a line for the output with enough space for possible future Total values, and put the line into output.

if index not in indexInResult:
        #Where it is in "output"
        indexInResult[index]=indexCounter
        location=indexInResult[index]
        indexCounter+=1
        
        #What is put into "output"
        newEntry=[] 
        for data in range(len(items[entry])): 
            # Move the untouched fields over
            if data!=columnizedField and data!=valueField:
                newEntry.append(items[entry][data])
        newEntry.extend([""]*(entryLength-len(newEntry))) 
        # Extend the entry to allow for all values to be moved correctly
        
        #Put it into "output"
        output.append(newEntry)

c. In the other case, if we have already seen the subject, we just need to find where to put it by consulting the indexInResult dictionary.

    elif index in indexInResult:
        location=indexInResult[index]

d. Finally, the most important part and why we are making this program in the first place: putting the data into its rightful place.

    output[location][newColumns[column]] = value

12. Rewrite the data

This ends the For loop, but before we can obtain the result, we must remember something: the output is currently a bunch of lists within a bigger list but when we first read the items, it was just a bunch of strings with ‘\t’ separators in a list, so we need to change it back so VS Code can actually give us something we can copy paste directly into Excel and/or Google Sheets.

for entry in range(len(output)):
    newForm="" # So that when pasted back to the original location the separators are retained
    for data in range(len(output[entry])-1):
        newForm+=output[entry][data]+separator
        # The -1 in the for loop is to prevent the separator being added to the very end of each entry
    newForm+=output[entry][-1]+"\n" # Create newline when writing back to text file
    
    output[entry]=newForm

At at the very end, we write down the output:

file=open(result,"w")
for entry in output: file.write(entry)
file.close()

Once again, it is very simple but less than when we read the Data. This is because VS Code can’t write a list into a text file, only strings.

The Full Code

data="D:\VS Code\Python\Projects\Academic\ColumnizeData.txt"
result="D:\VS Code\Python\Projects\Academic\ColumnizeResult.txt"
output=[]

file=open(data,"r")
items=file.readlines()
file.close()

indexField="CIP Code" # Identify the entries that will gain new fields
columnizedField="Class of Year" # The field that becomes new column headers
valueField="Total" # The field with values that will be carried across the new columns
separator="\t" # What Python consider the separator to be when put into a text file

for i in range(len(items)): items[i]=items[i].strip().split(separator)

indexField=items[0].index(indexField)
columnizedField=items[0].index(columnizedField)
valueField=items[0].index(valueField)

indexInResult={} #e.g. 3.0103:0 (its index as an entry in the output)
indexCounter=1 #To assign the indexInResult values
newColumns={} #e.g. 2021:4 (its index as a field in the output)
entryLength=len(items[0])-2 
#The length of each entry in the "output", but we'll delete the columnizedField and valueField so -2
#To assign the location of each new column

for i in range(1,len(items)):
    if items[i][columnizedField] not in newColumns:
        newColumns[items[i][columnizedField]]=entryLength
        entryLength+=1 #To assign any new columns

# First line of "output"
newEntry=[]
for data in range(len(items[0])):
    if data!=columnizedField and data!=valueField:
        newEntry.append(items[0][data])
for eachColumn in newColumns:
    newEntry.append(eachColumn)
output.append(newEntry)
        

for entry in range(1,len(items)):
    index=items[entry][indexField] #e.g. 3.0103
    column=items[entry][columnizedField] #e.g. 2021
    value=items[entry][valueField] #e.g. 15
    
    
    if index not in indexInResult:
        #Where it is in "output"
        indexInResult[index]=indexCounter
        location=indexInResult[index]
        indexCounter+=1
        
        #What is put into "output"
        newEntry=[] 
        for data in range(len(items[entry])): 
            # Move the untouched fields over
            if data!=columnizedField and data!=valueField:
                newEntry.append(items[entry][data])
        newEntry.extend([""]*(entryLength-len(newEntry))) 
        # Extend the entry to allow for all values to be moved correctly
        
        #Put it into "output"
        output.append(newEntry)
        
    elif index in indexInResult:
        location=indexInResult[index]
    
    output[location][newColumns[column]] = value

for entry in range(len(output)):
    newForm="" # So that when pasted back to the original location the separators are retained
    for data in range(len(output[entry])-1):
        newForm+=output[entry][data]+separator
        # The -1 in the for loop is to prevent the separator being added to the very end of each entry
    newForm+=output[entry][-1]+"\n" # Create newline when writing back to text file
    
    output[entry]=newForm

file=open(result,"w")
for entry in output: file.write(entry)
file.close()

For more general help regarding Python, you can check out this online course, or perhaps you would be more interested in an external library dedicated to interacting with Excel using Python?

Tags: Digital Humanities, python, Tutorial

1 thought on “A Python’s Guide to Columnizing with Text Files”

Parker says:

March 6, 2023 at 8:52 pm

When you break it down as such, this made me realize how much actually goes in to columnizing. Admittedly, code (that isn’t R) seems daunting to me, and when I scroll to the bottom of the tutorial and see all of the code at once, I feel that same daunting feeling, but you do a great job of breaking down the pieces into digestible pieces. This tutorial also reminds me of the debate we had earlier about whether or not humanities students should learn how to code. Good work!

A Python’s Guide to Columnizing with Text Files

1. Create 3 new files in the same folder: one for the input table, one for the program itself and the last is for the output.

2. Right click on the input (Data) file and output (Result) file and select one of the Copy Path.

3. Put the links into the following form, as well as create a list that will hold our eventual output.

4. Ask Process to read the Data like so:

5. Select from the dataset headers:

6. Use the .split() method to turn each line into words and .strip() to remove the ‘\n’ at the end.

7. Change the fields into index numbers.

8. Create 2 new dictionaries and the supporting variables.

9. Next, enter in all the unique years that will be columnized into the newColumns dictionary, as well as track how long each new entry (aka line) of the output is.

10. Add in the header of the output. It’s very different from the rest so it needs to be done separately.

11. Columnize the data

12. Rewrite the data

The Full Code

1 thought on “A Python’s Guide to Columnizing with Text Files”

Leave a Reply Cancel reply

Final Project Update: Graduate School Mapping

Thorpe Pool 3D Tour Update

Analysis of Linked Jazz

Swiss Cheese 3D Model

Carleton’s major Majors Final Project Update

FINAL PROJECT: Carletonian Corpus

Week 9 Blog – Tutorial

Week 4 – Network analysis DH project reflection

Alumni Visualization: A Display of Carleton’s English Majors

Carleton’s Major Majors Website

Data Visualization Update: 3D Thorpe Pool

Data Visualization: Graduate School Mapping

Final Data Visualization

Popularity of Carleton Majors by Year – An Update

Exploring ArcGIS Mapping

Hacking the Humanities 2023W