Week 11: Data preprocessing - Tidying/Cleaning data

Objectives

This week, you will learn how to clean the data that you got from the Internet, so that the data are ready for the analyses.

This tutorial is exemplified with the data from ETToday!

1 What does “cleaning” mean?

1.1 Why the need to clean the data?

We just scraped data from the Internet. Even if we have been very careful when writing the scraping code, there are surely unnecessary data we need to remove, format problems, or further annotations to do. The reasons are quite simple: (a) while the majority of the webpages we scraped has a consistent structure, it is possible that some were slightly different, (b) unnecessary parts were coded under the same HTML nodes, such that it was not possible to scrape the data we wanted without scraping them as well, (c) it was easier to write the scraping code that way.

That’s why we need to go through the cleaning process before rushing into the analyses! The main problems we will encounter are:

The format of the table is not suited for analyses, so we need to arrange it
Unnecessary data need to be removed, otherwise it will affect the results
Simple annotations can be made, and these will be helpful for both the cleaning and analyses processes

1.2 Some cleaning principles

We clean, we don’t fish

The cleaning process is here such that the data set will be ready for further analyses. But there are two important points we need to keep in mind:

We might over-clean, meaning that while we removed unnecessary data, we removed much more than what we thought. That’s why we need to check each time that we are not removing too much
We might be biased. We have a research question in mind, and maybe some ideas about what the results may look like. Sometimes, we are so engaged in our study that we assume the results before conducting the analyses… and in that case, we may be so biased that we consider counterexamples as errors to remove before the analyses. This is completely wrong, and the cleaning process should be blind!!

We must be aware of the final format of the data set

Especially in R, the most common format of the data set is “1 row = 1 data point”. A “data point” may be one whole text, one paragraph, one sentence, etc. Then, each column of that row is related to the same data point. In other words, it is not ideal to have the same data point on two different rows! Transforming the data towards that direction is also part of the cleaning process.

Example of a data set with a suitable format (fake data):

Date	Article	Sentence	… (other data)	Link
2024-01-01	Article 1	Sentence 1	…	Link 1
2024-01-01	Article 1	Sentence 2	…	Link 1
2024-01-01	Article 1	Sentence 3	…	Link 1
2024-01-03	Article 2	Sentence 1	…	Link 2

We must be close to our data

There are at least two reasons for that:

The closer you are to your data, the more you know what to do to clean them
Not all the cleaning steps are necessary, it really depends on your data set!

1.3 A proposed workflow

Please do not hesitate to zoom in!

2 Preparation for the cleaning process: Example with the ETtoday corpus

2.1 Playing hide-and-seek: The goal of the cleaning process

As it is made evident from the workflow, the cleaning process is a little bit like playing hide-and-seek. We must find what we need to clean, and one way is to randomly scroll the data.

How to know what to clean?

The reality is that we will always miss something at some point. Most of the time, we realize that we need more cleaning during the analysis process. At that moment, we just add another step… and this is why having scripts already prepared is essential, since implementing a new step is easy and not time-consuming!

2.2 List of the steps to clean the ETtoday corpus

Here is the list of the steps to go through to clean the ETtoday corpus based on my observations:

Transformations:
- The whole article appear in one cell. We need to transform such that one cell = one paragraph
Annotations:
- Annotate based on year, month and day
Data removal:
- Remove the legends of the images
- Remove messages from ETtoday unrelated to the article (for instance: “be caution of your alcohol consumption”, etc.)
- Remove rows corresponding to the identity of the journalist
- Remove data from 2023
- Remove empty rows
- Remove weird scraped instances (e.g., HTML language, etc.)

3 Now let’s turn into R coding

Now we are clear regarding what we need to clean, so let R do the job for us!

First, you can download the script here, and follow the steps below to understand how it works.

[Disclaimer: The explanation of the R codes is made by Gemini as an illustration of the use of such tools to decode a script.]

3.1 Prepare the environment

This section sets up the workspace by loading necessary tools and the raw data.

library(dplyr)
library(tidyr)
library(openxlsx)

Load the libraries:

library(dplyr) and library(tidyr): These are loaded for data manipulation. dplyr is used for filtering and mutating data, while tidyr is specifically used here to reshape the data structure (unnesting paragraphs).

library(openxlsx): Loaded to handle Excel file export at the end of the script.

Load the originally scraped data:

load(file = "ArticleETToday_CorpusCourse.Rdata")

The command load(file = "ArticleETToday_CorpusCourse.Rdata") imports a previously saved R workspace file containing the raw scraped articles.

3.2 Examples of cleaning/preparing processes

This is the core of the script, where the raw text is processed into a clean format suitable for analysis.

Further annotations

The code extracts temporal metadata from a single string column.

Article_total$year <- substr(Article_total$time, start = 1, stop = 4)
Article_total$month <- substr(Article_total$time, start = 6, stop = 7)
Article_total$day <- substr(Article_total$time, start = 9, stop = 10)

It creates three new columns (year, month, day) by cutting specific parts of the time column using the substr (substring) function. For example, it assumes the first 4 characters representing the year.

Data transformation

Split the articles into paragraphs

This is a structural transformation step.

Logic: The original data has one row per article. This code changes the unit of analysis to one row per paragraph.

Article_total$original_article <- Article_total$body

Article_total2 <- Article_total %>% 
  mutate(body = strsplit(as.character(body), "\r\n")) %>% 
  unnest(body)

Code: It splits the body text wherever it finds a carriage return code (). It then uses unnest(body) to expand these split pieces into individual rows.

Result: If an article had 5 paragraphs, it will now occupy 5 rows in the dataframe Article_total2.

Remove unwanted paragraphs

This section acts as a filter to remove “noise”—text that is not part of the actual news report.

Example 1 (Image Legends): Removes rows containing specific symbols like “▲” or “▼”, and lines starting with “圖／” (Image) or “文／” (Text/Author), which usually indicate captions or credits.

Article_total2 <- Article_total2[-grep("▲", Article_total2$body),]
Article_total2 <- Article_total2[-grep("▼", Article_total2$body),]

### Most of them start with the "圖／" (image) and "文／" (text) source: Check first if that's the case
test <- Article_total2[+grep("圖／", Article_total2$body),]
test <- Article_total2[+grep("文／", Article_total2$body),]

### Indeed the case: Can be removed
Article_total2 <- Article_total2[-grep("圖／", Article_total2$body),]
Article_total2 <- Article_total2[-grep("文／", Article_total2$body),]

Example 2 (Boilerplate/System Messages): Removes standard ETtoday interface text and links.

### Most of them start with the "●" symbol: Check first if that's the case
test <- Article_total2[+grep("●", Article_total2$body),]

### Indeed the case: Can be removed
Article_total2 <- Article_total2[-grep("●", Article_total2$body),]

### Some start with the "►" and "▸" symbol (link to other articles): Check first if that's the case
test <- Article_total2[+grep("►", Article_total2$body),]

test <- Article_total2[+grep("▸", Article_total2$body),]

### Indeed the case: Can be removed
Article_total2 <- Article_total2[-grep("►", Article_total2$body),]
Article_total2 <- Article_total2[-grep("▸", Article_total2$body),]

### Most of them start with the "★" symbol: Check first if that's the case
test <- Article_total2[+grep("★", Article_total2$body),]

### Indeed the case: Can be removed
Article_total2 <- Article_total2[-grep("★", Article_total2$body),]

### Some rows are just made of the message "【其他新聞】" (other news): Check first if that's the case
test <- Article_total2[+grep("【其他新聞】", Article_total2$body),]

### Indeed the case: Can be removed
Article_total2 <- Article_total2[-grep("【其他新聞】", Article_total2$body),]

### Some rows are just made of the message "更多新聞" (more news): Check first if that's the case
test <- Article_total2[+grep("更多新聞", Article_total2$body),]

### Indeed the case: Can be removed
Article_total2 <- Article_total2[-grep("更多新聞", Article_total2$body),]

### Some rows are just made of the message "延伸閱讀" (read more): Check first if that's the case
test <- Article_total2[+grep("延伸閱讀", Article_total2$body),]

### Indeed the case: Can be removed
Article_total2 <- Article_total2[-grep("延伸閱讀", Article_total2$body),]

It filters out symbols like “●”, “►”, “▸”, and “★”.

It removes navigational phrases like “【其他新聞】” (Other news), “更多新聞” (More news), and “延伸閱讀” (Extended reading).

Example 3 (Journalist Identity):

It creates a temporary column FirstTwoCharacters to check the start of the paragraph.

### Most of them start with the two characters "記者" (journalist)
test <- Article_total2
test$FirstTwoCharacters <- substr(Article_total2$body, start = 1, stop = 2)
## Only 178 sentences out of 361154 will be wrongly removed, quite acceptable

Article_total2$FirstTwoCharacters <- substr(Article_total2$body, start = 1, stop = 2)
Article_total2 <- Article_total2[-grep("記者", Article_total2$FirstTwoCharacters),]

Article_total2$FirstTwoCharacters <- NULL


### Some start with the four characters "實習記者" (journalist-internship)
test <- Article_total2
test$FirstFourCharacters <- substr(Article_total2$body, start = 1, stop = 4)
test <- test[+grep("實習記者", test$FirstFourCharacters),]

## Indeed all rows need to be removed
Article_total2$FirstFourCharacters <- substr(Article_total2$body, start = 1, stop = 4)
Article_total2 <- Article_total2[-grep("實習記者", Article_total2$FirstFourCharacters),]

Article_total2$FirstFourCharacters <- NULL

It removes rows starting with “記者” (Reporter) or “實習記者” (Intern Reporter) to strip out bylines.

Example 4 (Date Filtering):

Article_total2 <- Article_total2[!(Article_total2$year=="2023"), ]

It removes all rows where the year is “2023”. The comment notes this is “controversial,” possibly because it indiscriminately drops data based on date rather than content quality.

Remove empty rows This section deals with “invisible” noise.

Iterative Cleaning: The user manually identified various forms of empty space (empty strings ““, single spaces” “, multiple spaces” “, and specific whitespace characters found at specific row indices like 29 or 13974).

Process: The code repeatedly checks for these specific empty patterns and removes them to ensure the final dataset contains only text with content.

## Test the code to make sure we are not removing too much
test <- Article_total2[(Article_total2$body==""), ]

## Indeed all the rows correspond to empty paragraphs

Article_total2 <- Article_total2[!(Article_total2$body==""), ]

## Visual inspection: Still empty rows, such as line number 29
Article_total2$body[29] ## corresponds to a space

## Test the code to make sure we are not removing too much
test <- Article_total2[(Article_total2$body==" "), ]

## Indeed all the rows correspond to paragraphs with a space
Article_total2 <- Article_total2[!(Article_total2$body==" "), ]

## Potentially more than one space: Test the code to make sure we are not removing too much
test <- Article_total2[(Article_total2$body=="  "), ] #test two spaces
test <- Article_total2[(Article_total2$body=="   "), ] #test three spaces
test <- Article_total2[(Article_total2$body=="    "), ] #test four spaces
test <- Article_total2[(Article_total2$body=="     "), ] #test five spaces
test <- Article_total2[(Article_total2$body=="      "), ] #test six spaces
test <- Article_total2[(Article_total2$body=="       "), ] #test seven spaces

## Indeed all the rows correspond to paragraphs with 2 to 5 spaces
Article_total2 <- Article_total2[!(Article_total2$body=="  "), ]
Article_total2 <- Article_total2[!(Article_total2$body=="   "), ]
Article_total2 <- Article_total2[!(Article_total2$body=="    "), ]
Article_total2 <- Article_total2[!(Article_total2$body=="     "), ]

## Visual inspection: Still empty rows, such as line number 29
Article_total2$body[29] 

## I don't really know what this is, check using the row number instead of the symbol itself
test <- Article_total2[(Article_total2$body==Article_total2$body[29]), ]

## Extract the symbol
ToRemove <- Article_total2$body[29] 

## Indeed all the rows correspond to paragraphs with a space --> Removing using the symbol itself
Article_total2 <- Article_total2[!(Article_total2$body==ToRemove), ]

## Visual inspection: Still empty rows, such as line number 13974
Article_total2$body[13974] ## corresponds to nothing, but wasn't caught earlier

## Extract the symbol, test and remove
ToRemove <- Article_total2$body[13974] 
test <- Article_total2[(Article_total2$body==ToRemove), ]
Article_total2 <- Article_total2[!(Article_total2$body==ToRemove), ]

## Visual inspection: Still empty rows, such as line number 8368
Article_total2$body[8368] ## I don't know what kind of space it is

## Extract the symbol, test and remove
ToRemove <- Article_total2$body[8368] 
test <- Article_total2[(Article_total2$body==ToRemove), ]
Article_total2 <- Article_total2[!(Article_total2$body==ToRemove), ]

## Visual inspection: Still empty rows, such as line number 67
Article_total2$body[67] ## I don't know what kind of space it is

## Extract the symbol, test and remove
ToRemove <- Article_total2$body[67] 
test <- Article_total2[(Article_total2$body==ToRemove), ]
Article_total2 <- Article_total2[!(Article_total2$body==ToRemove), ]

## Visual inspection: Still empty rows, such as line number 15279
Article_total2$body[15279] ## I don't know what kind of space it is

## Extract the symbol, test and remove
ToRemove <- Article_total2$body[15279] 
test <- Article_total2[(Article_total2$body==ToRemove), ]
Article_total2 <- Article_total2[!(Article_total2$body==ToRemove), ]

Remove weird scraping instances

HTML Cleanup: The code uses grepl("<[^>]+>", ...) to identify rows that still contain HTML tags (text starting with < and ending with >).

## This correspond to cases where the signs "<" + other signs appear, these are rows with HTML language
test <- Article_total2
test_HTML <- grepl("<[^>]+>", Article_total2$body)
test$HTML <- test_HTML
test <- test[(test$HTML=="TRUE"), ]

## Indeed corresponds to rows to remove
Article_total2$HTML <- test_HTML
Article_total2 <- Article_total2[!(Article_total2$HTML=="TRUE"), ]

Article_total2$HTML <- NULL

Action: It filters out any rows where test_HTML is TRUE, removing residual web code that wasn’t caught by the initial scraper.

3.3 Save the data

The final section exports the cleaned dataset in two formats for future use.

Save as an Excel file

Uses write.xlsx to save Article_total2 as a standard Excel spreadsheet.

Save as an RData file

Uses save to store the dataframe as an R object, which preserves the data types (like factors vs strings) better than Excel for future R sessions.

4 Markdown document, PDF output file, RData and Excel files of the scraped data

You can find the pre-filled Markdown document of this section here. Here is the PDF output of the same document.

The RData output file can be downloaded here. The Excel file corresponding to this dataset is here.