Week 12: Data analysis


Objectives

This week, you will learn how to analyze your data quickly and automatically with R.

!! This is an overview of how the analysis process, without reference to statistical modeling!

Note

The data analysis is exemplified with the ETtoday dataset that we cleaned last week. In case, you can download the R data using this link: here

1 How to analyze corpus data?

Let’s recap first what we learned and did in the past weeks:

  • We learned what a corpus linguistics study is, the questions we can ask, and the types of corpora we can/should select to conduct the study;

  • We learned how to automatically scrape data from the Internet, with examples from different sources;

  • Last week, we learned how to “clean” the raw data, how to prepare the data for further analyses.

And this is the main objective for this week: How to analyze the data that we prepared?

Before getting into the details, we need to keep in mind the overall workflow of a corpus study. (You can zoom in)

Note

There are many many analyses methods. Their common point is that they involve automatic processing, which is very helpful when we are dealing with large datasets. Some analyses methods are quite straightforward and traditional (e.g., frequency tables), while some others are more on the computational side. Always remember to select the analysis method that is appropriate for your research question!

For this class, two methods will be presented: Word lists and KWIC (Key Word In Context). These are classic yet very useful methods, and easy to understand and implement for any kind of research.

1.1 KWIC (Key Word In Context)

The first type of analysis to introduce is called Key Word In Context, more often found under the acronym kwic. The idea is that we can better understand the meaning of a word based on how it is used in a sentence. KWIC analyses are therefore very suitable when we target a specific word or syntactic construction.

KWIC analyses can be used for:

  • Semantic analyses: Idea that the meaning of a specific word can be retrieved from its use in a sentences, based on the meaning of its neighbors.

  • Morphosyntactic analyses: Idea that the morphosyntactic context where a word or a construction is used is helpful in understanding its specific morphological and syntactic features, as well as the meaning it conveys.

Here is an example below, with the Mandarin word keai ‘cute’ (notice that these are fake data just for an illustration):

Sentence index Before keyword Keyword After keyword
Sentence 1

hen

(very)

keai

(cute)

de mao

(DE cat)

Sentence 2

feichang

(very, extremely)

keai

(cute)

de gou

(DE dog)

Based on these data, and assuming that these are the most frequent instances that we found in the corpus, we can infer that:

  • Semantically, the word keai is most often used to describe animals or pets, based on the following segment;

  • Morphosyntactically, the word keai behaves as an adjective (or stative verb, depending on the analysis), based on the preceding segment.

1.2 Word lists based on frequency

Another extremely common way to analyse corpus data is to count the frequency of each word, in order to have an idea of the most frequently used ones. It is easy to understand: we just need to count how many times a word occurred in the dataset we have. Hopefully, we obtain a list like the one below (again, these are fake data):

Word Frequency
keai (cute) 368
kuaile (happy) 354

The reality is that it is a little bit more complex that it seems to be, and we need to keep several remarks in mind:

  • Without any further data handling, it is more than likely that the most frequent words are (a) punctuation marks, and (b) grammatical markers (the so-called ‘closed-class’ words), since they are limited and appear obligatory in each sentence. The bad news is that you need further steps to obtain the table you wish for. The good news is that you can use this piece of information as a sanitary check. If you compute the frequency tables and it is not the case that grammatical words are most frequent, then something bad happened!

  • Defining what a “word” is is not easy. In English, the simplest way is to say that words can be separated with a space (even if this too simple definition is misleading). In Mandarin, there are no spaces between words… People created packages with dictionaries where words are listed such that we can still cut the sentences into words, but be aware that less common or newly created ones will not be detected! If your research question is really about new words, then you may consider adding them in the computer’s dictionary beforehand.

  • Frequency tables can be further annotated, as you can add the rank of the word, the frequency in terms of percentage in addition to raw count numbers, etc.

1.3 Combining KWIC and frequency-based word lists

Every kind of analysis has pros and cons, and we cannot say that one is better than another. Again, there are just better suited ways to analyze your data according to your research question. This even means that you can combine two types of analyses to obtain more insights!

For example, you can first proceed with a KWIC analysis, and you obtain the table as above. Then, in a second step, you can create the frequency table of the first word following or preceding the keyword. So it is a “KWIC + word list” analysis!

2 Let’s do the job with R

Now we have a better ideas of how to analyze the data conceptually. But how to do it technically? This is what this section is about.

First, you can download the script here, as well as the dataset we will work with here.

You can follow the steps below to understand how it works. But before explaining this script, more remarks are needed.

2.1 The quanteda package to conduct corpus analyses

We are very lucky that many smart and generous people around the world created R packages especially to deal with corpus data, and such packages are still being updated at the moment I am writing this section.

We are going to use the package called “quanteda”. You can find more information by clicking on this link.

We will also use another package developed by the same team, called “quanteda.textstats”.

install.packages("quanteda")
install.packages("quanteda.textstats")
Note

There exist several packages used to segment Mandarin sentences into words. Here, we will use the built-in functions of the “quanteda” package. If you browse the Internet, you will notice that some people prefer using the “jiebaR” package. The problem is that this package is not available on the CRAN anymore, and it can be quite tricky to install it on your computer. So for this week, we keep it simple!

2.2 Workflow for the analysis of corpus data in R

Here is an overview of the workflow. First, we start with the clean corpus, and then we create a new dataset where the sentences are cut into words. Based on this new dataset, we can perform a KWIC analysis, create a word list, or combine the two types of analyses. Finally, we clean a little bit (as in the example below; punctuation marks, digits, selecting only the sentence/phrase, etc.), we add back the information from the original corpus, and we are done!

2.3 Explanation of the R script

2.3.1 Prepare the environment

2.3.1.1 Load the libraries

First, we need to load the necessary packages for our analysis.

library(quanteda)
Package version: 4.3.1
Unicode version: 14.0
ICU version: 71.1
Parallel computing: disabled
See https://quanteda.io for tutorials and examples.
library(quanteda.textstats)
library(tidytext)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(stringr)
library(openxlsx)
#Sys.setlocale(category = "LC_ALL", locale = "cht")
  • library(quanteda): Loads the core package we are using for text analysis, which allows us to create tokens and document-feature matrices.

  • library(quanteda.textstats): A companion package to quanteda that provides statistical functions, such as calculating word frequencies.

  • library(tidytext): Useful for converting text data into “tidy” formats if we need to switch between quanteda and tidyverse workflows.

  • library(dplyr): Essential for data manipulation (like joining tables or filtering data).

  • library(stringr): Provides easy-to-use functions for string manipulation and Regular Expressions.

  • library(openxlsx): Used at the end of the script to export our results into Excel files.

  • Sys.setlocale(…): This line is commented out (#), but it is there in case you run into encoding issues on Windows. It sets the system locale to Traditional Chinese.

2.3.1.2 Load the originally scraped data R

load(file = "ArticleETToday_CorpusCourse_CLEAN.Rdata")

load(…): We load the .Rdata file containing the cleaned ETToday corpus we prepared in previous weeks. This brings the Article_total2 object into our environment.

2.3.2 Key Word In Context (KWIC)

2.3.2.1 Prepare the dataset for the analyses

Before we can analyze the text, we need to ensure our documents have IDs and are properly tokenized.

Article_total2$docname <- paste0("text", 1:nrow(Article_total2))

Article_tokens <- tokens(Article_total2$body)
  • Article_total2$docname <- …: We create a new column called docname. We use paste0 to generate a unique ID for each article (e.g., “text1”, “text2”, etc.) based on the row number (nrow).

  • Article_tokens <- tokens(…): We use the quanteda function tokens() to break the text in the body column into individual words (tokens). This creates a specialized tokens object required for the next steps.

2.3.2.2 Perform the KWIC segmentation

2.3.2.2.1 KWIC segmentation

Now we search for a specific keyword to see how it is used in context.

kwic_data <- kwic(Article_tokens, pattern = "有", window = 30)
  • kwic(…): This function (“Key Word In Context”) searches our tokenized text.

    • pattern = “有”: We are searching for the character “有” (to have/there is).

    • window = 30: We tell R to capture 30 tokens to the left (pre) and 30 tokens to the right (post) of our keyword.

2.3.2.2.2 Annotate the KWIC dataset

The kwic function gives us the context, but we lose the original metadata (like the article date or category). We need to put it back.

kwic_data <- as.data.frame(kwic_data)

kwic_data <- right_join(kwic_data, Article_total2, by = "docname")

kwic_data <- na.omit(kwic_data)
  • as.data.frame(kwic_data): The output of kwic is a special object; we convert it into a standard data frame so we can manipulate it easily.

  • right_join(…): We merge our KWIC results with the original Article_total2 dataframe. We match them using the docname column we created earlier.

  • na.omit(kwic_data): We remove any rows that have missing values (NAs) to ensure our dataset is clean for analysis.

2.2.3 (Optional) Clean the context to keep only the phrase where the keyword is found

he window of 30 words might include parts of previous or subsequent sentences. We want to “trim” the context to only the sentence containing the keyword.

## Keep original information just in case
kwic_data$pre_original <- kwic_data$pre
kwic_data$post_original <- kwic_data$post

## Post context
symbol1 <- "\\。" 
kwic_data$post <- sub(paste0("(", symbol1, ").*"), "\\1", kwic_data$post)

symbol2 <- "\\," 
kwic_data$post <- sub(paste0("(", symbol2, ").*"), "\\1", kwic_data$post)

symbol3 <- "\\?" 
kwic_data$post <- sub(paste0("(", symbol3, ").*"), "\\1", kwic_data$post)

symbol4 <- "\\!" 
kwic_data$post <- sub(paste0("(", symbol4, ").*"), "\\1", kwic_data$post)

## Pre context
kwic_data$pre <- sub(".*。([^*。]*)$", "。\\1", kwic_data$pre)
kwic_data$pre <- sub(".*,([^*,]*)$", ",\\1", kwic_data$pre)
kwic_data$pre <- sub(".*?([^*?]*)$", "?\\1", kwic_data$pre)
kwic_data$pre <- sub(".*!([^*!]*)$", "!\\1", kwic_data$pre)
  • kwic_data$pre_original <- …: We back up the original context columns before modifying them.

  • symbol1 <- “\。”: We define the punctuation mark we want to stop at (the Chinese period). The double backslash escapes the character for Regex.

  • sub(paste0(“(”, symbol1, “).*”), “\1”, …): This Regular Expression looks for the first period in the post context and deletes everything after it. It effectively cuts the text off at the end of the sentence.

  • sub(“.。([^*。])$”, “。\1”, …): This mirrors the operation for the pre context. It looks for the last period occurring before our keyword and deletes everything before it, so the context starts at the beginning of the current sentence.

Note: The code repeats this process for commas (,), question marks (?), and exclamation marks (!) to handle different sentence boundaries.

## Have a look at the data (I delete some columns so that it is easier to display on the website)
kwic_data_for_website <- kwic_data
kwic_data_for_website$original_article <- NULL
kwic_data_for_website$body <- NULL
kwic_data_for_website$pre_original <- NULL
kwic_data_for_website$post_original <- NULL
knitr::kable(head(kwic_data_for_website))
docname from to pre keyword post pattern time class title url year month day
text1 19 19 , 竟然 玩家 突發 奇想 將 兩 款 遊戲 尬 在一起 , 2024年01月01日 10:57 政治 神人把《我的世界》改成《血源詛咒》 還原度超高玩家狂敲碗:快點出 https://www.ettoday.net/news/20231231/2652951.htm 2024 01 01
text6 55 55 , 但 已經 不少 玩家 表示 相當 期待 , 2024年01月01日 10:57 政治 神人把《我的世界》改成《血源詛咒》 還原度超高玩家狂敲碗:快點出 https://www.ettoday.net/news/20231231/2652951.htm 2024 01 01
text12 18 18 , 常見 補品 燒酒雞 、 薑 母鴨 、 羊肉爐 、 藥 燉 排骨 等 , 2024年01月01日 09:07 社會 跨年冬令進補爐火需留意 安裝住宅用火災警報器避免悲劇 https://www.ettoday.net/news/20231231/2652951.htm 2024 01 01
text14 57 57 , 選擇 熄火 安全 裝置 及 溫度 感知 功能 爐 具 , 2024年01月01日 09:07 社會 跨年冬令進補爐火需留意 安裝住宅用火災警報器避免悲劇 https://www.ettoday.net/news/20231231/2652951.htm 2024 01 01
text17 30 30 , 發現 29 歲 林 姓 男子 涉 重 嫌 , 2024年01月01日 10:29 社會 半工半讀買的機車被偷!23歲女人生第一輛 警埋伏10hrs抓賊 https://www.ettoday.net/news/20231231/2652951.htm 2024 01 01
text19 46 46 , 網友 始 知 台灣 改 車 界 這 號 人物 存在 。 2024年01月01日 13:00 社會 揭密廖老大打龜號進化史!他棄台積電工師 兩岸改裝達人之爭曝 https://www.ettoday.net/news/20231231/2652951.htm 2024 01 01
2.3.2.2.3 Combined analysis: Frequency table of the first word following you ‘to have’

We can now analyze what words typically follow “有”.

## Extract the first word
kwic_data$post_first_word <- word(kwic_data$post, 1)

## We need to tranform the tokenized data into a 'dfm' dataset
kwic_data_freq <- dfm( tokens(kwic_data$post_first_word, remove_punct = TRUE) )

kwic_data_freq <- textstat_frequency(kwic_data_freq)

## Clean a little bit
kwic_data_freq <- kwic_data_freq[-grep("[[:digit:]]", kwic_data_freq$feature),]

## Recreate the rank
kwic_data_freq$rank <- 1:length(kwic_data_freq$rank)

knitr::kable(head(kwic_data_freq, 100))
feature frequency rank docfreq group
1 民眾 1291 1 1291 all
3 694 2 694 all
4 網友 656 3 656 all
5 很多 608 4 608 all
6 600 5 600 all
7 問題 588 6 588 all
8 什麼 580 7 580 all
10 媒體 550 8 550 all
11 533 9 533 all
12 一名 503 10 503 all
14 任何 476 11 476 all
15 逃亡 445 12 445 all
16 可能 437 13 437 all
17 許多 399 14 399 all
18 機會 395 15 395 all
19 一個 338 16 338 all
21 信心 298 17 298 all
22 其他 298 18 298 all
23 271 19 271 all
24 更多 264 20 264 all
25 一些 262 21 262 all
26 256 22 256 all
27 相當 249 23 249 all
28 不少 234 24 234 all
29 需要 228 25 228 all
30 223 26 223 all
31 這樣 222 27 222 all
32 222 28 222 all
33 必要 220 29 220 all
35 207 30 207 all
36 不同 200 31 200 all
37 毒品 198 32 198 all
38 能力 196 33 196 all
39 多少 194 34 194 all
40 191 35 191 all
43 這麼 181 36 181 all
44 非常 181 37 181 all
45 部分 177 38 177 all
46 羈押 176 39 176 all
48 相關 172 40 172 all
49 一定 169 41 169 all
50 明顯 162 42 162 all
51 共識 156 43 156 all
53 153 44 153 all
54 超過 152 45 152 all
55 151 46 151 all
56 責任 149 47 149 all
57 重大 147 48 147 all
58 很大 147 49 147 all
59 違反 140 50 140 all
60 139 51 139 all
61 135 52 135 all
62 看到 135 53 135 all
63 疑慮 134 54 134 all
64 意願 132 55 132 all
65 意見 130 56 130 all
66 128 57 128 all
67 興趣 126 58 126 all
68 爭議 124 59 124 all
69 自己 124 60 124 all
70 發生 122 61 122 all
71 120 62 120 all
72 大量 115 63 115 all
73 糾紛 111 64 111 all
74 幫助 111 65 111 all
75 一位 111 66 111 all
76 110 67 110 all
77 108 68 108 all
78 疏失 106 69 106 all
79 條件 106 70 106 all
80 債務 106 71 106 all
81 事實 104 72 104 all
82 103 73 103 all
83 兩個 102 74 102 all
84 高度 101 75 101 all
85 異狀 100 76 100 all
86 違法 99 77 99 all
87 96 78 96 all
88 多次 94 79 94 all
89 政治 93 80 93 all
90 93 81 93 all
92 91 82 91 all
94 90 83 90 all
95 高達 89 84 89 all
96 男子 89 85 89 all
97 諸多 88 86 88 all
98 異常 86 87 86 all
99 瑕疵 86 88 86 all
100 86 89 86 all
101 過失 86 90 86 all
102 哪些 85 91 85 all
103 85 92 85 all
104 幾個 84 93 84 all
105 可能是 84 94 84 all
106 83 95 83 all
107 車輛 82 96 82 all
109 81 97 81 all
110 一輛 81 98 81 all
111 這種 80 99 80 all
112 78 100 78 all
  • word(kwic_data$post, 1): Uses stringr to extract specifically the first word from the post (context after) column.

  • tokens(…): We tokenize this list of “first words”.

  • dfm(…): We convert those tokens into a Document-Feature Matrix.

  • textstat_frequency(…): We calculate how often each word appears.

  • grep(“[[:digit:]]”, …): We use grep to find any words that contain numbers (digits) and remove them (using the minus sign -) to clean up our results.

  • 1:length(…): Since we removed some rows, we reset the rank column so it goes from 1 to N sequentially.

2.3.2.3 Save the data

Finally, we save our hard work.

write.xlsx(kwic_data, "ArticleETToday_KWIC_You.xlsx")
save(kwic_data, file = "ArticleETToday_KWIC_You.Rdata")
  • write.xlsx: Exports the dataframe to an Excel file for manual inspection.

  • save: Saves the R object to an .Rdata file so we can load it quickly in future R sessions.

2.3.3 Frequency tables

2.3.3.1 Create the overall frequency table

2.3.3.1.1 Creation of the first table

Now, let’s look at the frequency of words across the entire corpus, not just around a keyword.

Article_tokens_frequency <- dfm(
  tokens(Article_total2$body,
         remove_punct = TRUE)
  )
Article_tokens_frequency <- textstat_frequency(Article_tokens_frequency)

table_AllWordsFreq_Top100 <- head(Article_tokens_frequency, 100) 
knitr::kable(table_AllWordsFreq_Top100)
feature frequency rank docfreq group
357709 1 159298 all
116544 2 88581 all
84106 3 66187 all
71392 4 57501 all
69084 5 49761 all
67573 6 53558 all
65657 7 53298 all
63140 8 32390 all
61733 9 49197 all
59999 10 48402 all
59965 11 51054 all
49873 12 38094 all
表示 49840 13 47384 all
48807 14 33965 all
47947 15 43097 all
47910 16 34208 all
45675 17 39070 all
44905 18 36914 all
44599 19 36588 all
44505 20 38318 all
2 43707 21 34342 all
43394 22 32669 all
42361 23 28582 all
41664 24 35874 all
41052 25 26393 all
民眾 40981 26 29991 all
39541 27 31461 all
38658 28 34703 all
台灣 38028 29 24641 all
36770 30 29132 all
1 36107 31 28584 all
35360 32 30056 all
警方 35205 33 26808 all
34824 34 30380 all
3 34319 35 28447 all
33596 36 29140 all
33164 37 18553 all
32255 38 21397 all
32162 39 28049 all
31311 40 20937 all
31281 41 27774 all
30456 42 26271 all
民進黨 29648 43 20912 all
29633 44 25843 all
29426 45 19296 all
29209 46 18968 all
29041 47 16593 all
28849 48 18869 all
27927 49 19862 all
27838 50 24172 all
27598 51 22325 all
27440 52 23486 all
27073 53 25039 all
26964 54 17524 all
26866 55 22316 all
發生 26700 56 23129 all
25947 57 21844 all
25547 58 22155 all
25397 59 21497 all
國民黨 25126 60 17804 all
24367 61 18605 all
發現 23872 62 20970 all
23781 63 16125 all
立委 23697 64 17483 all
23330 65 21330 all
23243 66 21044 all
調查 23167 67 19220 all
23092 68 20346 all
指出 23045 69 22728 all
4 22781 70 19855 all
22453 71 14297 all
自己 22438 72 19022 all
沒有 21977 73 19189 all
21507 74 17793 all
5 21402 75 18870 all
21213 76 17565 all
男子 20713 77 15449 all
20494 78 13075 all
20250 79 18416 all
20149 80 17965 all
20140 81 18193 all
19999 82 13527 all
萬元 19934 83 14724 all
19778 84 17403 all
19477 85 16186 all
19266 86 12596 all
19204 87 18762 all
19140 88 14233 all
19098 89 16180 all
18916 90 12213 all
人員 18698 91 14908 all
18625 92 17039 all
18586 93 15277 all
10 18579 94 16677 all
18579 94 15239 all
18303 96 14481 all
6 18281 97 15998 all
政府 18281 97 14640 all
17969 99 13651 all
相關 17961 100 16032 all
  • table_AllWordsFreq_Top100 tokens(Article_total2$body, …): We tokenize the full body text of all articles, removing punctuation.

  • dfm(…): We turn that huge list of tokens into a Document-Feature Matrix.

  • textstat_frequency(…): We calculate the frequency of every unique word in the corpus.

  • head(…, 100): We create a smaller table containing only the top 100 most frequent words.

2.3.3.1.2 Clean it up a little bit

We often find “noise” in the data, like numbers, which we want to filter out.

## Example with numbers
table_FreqWord <- Article_tokens_frequency[-grep("[[:digit:]]", Article_tokens_frequency$feature),]

## Redo the ranking
table_FreqWord$rank <- 1:length(table_FreqWord$rank)
  • grep(“[[:digit:]]”, …): Similar to before, we search for any features (words) containing numbers and remove them from the list.

  • 1:length(…): We re-calculate the rank column to fill in the gaps left by the removed words.

2.3.3.1.3 Final table, addition of the percentage

Frequencies are good, but percentages help us understand the relative importance of a word.

table_FreqWord_Top100 <- head(table_FreqWord, 100)

table_FreqWord_Top100$percentage <- round(table_FreqWord_Top100$frequency/sum(table_FreqWord$frequency)*100, 5)
knitr::kable(table_FreqWord_Top100)
feature frequency rank docfreq group percentage
1 357709 1 159298 all 2.37874
2 116544 2 88581 all 0.77501
3 84106 3 66187 all 0.55930
4 71392 4 57501 all 0.47475
5 69084 5 49761 all 0.45940
6 67573 6 53558 all 0.44936
7 65657 7 53298 all 0.43661
8 63140 8 32390 all 0.41988
9 61733 9 49197 all 0.41052
10 59999 10 48402 all 0.39899
11 59965 11 51054 all 0.39876
12 49873 12 38094 all 0.33165
13 表示 49840 13 47384 all 0.33143
14 48807 14 33965 all 0.32456
15 47947 15 43097 all 0.31884
16 47910 16 34208 all 0.31860
17 45675 17 39070 all 0.30374
18 44905 18 36914 all 0.29862
19 44599 19 36588 all 0.29658
20 44505 20 38318 all 0.29596
22 43394 21 32669 all 0.28857
23 42361 22 28582 all 0.28170
24 41664 23 35874 all 0.27706
25 41052 24 26393 all 0.27299
26 民眾 40981 25 29991 all 0.27252
27 39541 26 31461 all 0.26295
28 38658 27 34703 all 0.25707
29 台灣 38028 28 24641 all 0.25288
30 36770 29 29132 all 0.24452
32 35360 30 30056 all 0.23514
33 警方 35205 31 26808 all 0.23411
34 34824 32 30380 all 0.23158
36 33596 33 29140 all 0.22341
37 33164 34 18553 all 0.22054
38 32255 35 21397 all 0.21449
39 32162 36 28049 all 0.21388
40 31311 37 20937 all 0.20822
41 31281 38 27774 all 0.20802
42 30456 39 26271 all 0.20253
43 民進黨 29648 40 20912 all 0.19716
44 29633 41 25843 all 0.19706
45 29426 42 19296 all 0.19568
46 29209 43 18968 all 0.19424
47 29041 44 16593 all 0.19312
48 28849 45 18869 all 0.19184
49 27927 46 19862 all 0.18571
50 27838 47 24172 all 0.18512
51 27598 48 22325 all 0.18352
52 27440 49 23486 all 0.18247
53 27073 50 25039 all 0.18003
54 26964 51 17524 all 0.17931
55 26866 52 22316 all 0.17866
56 發生 26700 53 23129 all 0.17755
57 25947 54 21844 all 0.17255
58 25547 55 22155 all 0.16989
59 25397 56 21497 all 0.16889
60 國民黨 25126 57 17804 all 0.16709
61 24367 58 18605 all 0.16204
62 發現 23872 59 20970 all 0.15875
63 23781 60 16125 all 0.15814
64 立委 23697 61 17483 all 0.15758
65 23330 62 21330 all 0.15514
66 23243 63 21044 all 0.15456
67 調查 23167 64 19220 all 0.15406
68 23092 65 20346 all 0.15356
69 指出 23045 66 22728 all 0.15325
71 22453 67 14297 all 0.14931
72 自己 22438 68 19022 all 0.14921
73 沒有 21977 69 19189 all 0.14615
74 21507 70 17793 all 0.14302
76 21213 71 17565 all 0.14107
77 男子 20713 72 15449 all 0.13774
78 20494 73 13075 all 0.13628
79 20250 74 18416 all 0.13466
80 20149 75 17965 all 0.13399
81 20140 76 18193 all 0.13393
82 19999 77 13527 all 0.13299
83 萬元 19934 78 14724 all 0.13256
84 19778 79 17403 all 0.13152
85 19477 80 16186 all 0.12952
86 19266 81 12596 all 0.12812
87 19204 82 18762 all 0.12771
88 19140 83 14233 all 0.12728
89 19098 84 16180 all 0.12700
90 18916 85 12213 all 0.12579
91 人員 18698 86 14908 all 0.12434
92 18625 87 17039 all 0.12386
93 18586 88 15277 all 0.12360
95 18579 89 15239 all 0.12355
96 18303 90 14481 all 0.12171
98 政府 18281 91 14640 all 0.12157
99 17969 92 13651 all 0.11949
100 相關 17961 93 16032 all 0.11944
101 17914 94 13623 all 0.11913
102 17898 95 15036 all 0.11902
103 17804 96 15484 all 0.11840
104 總統 17774 97 13169 all 0.11820
105 17680 98 12860 all 0.11757
107 17594 99 11644 all 0.11700
108 進行 17570 100 15739 all 0.11684
  • table_FreqWord_Top100 head(…, 100): We isolate the top 100 words again after our cleaning process.
  • table_FreqWord_Top100$frequency/sum(table_FreqWord$frequency): We divide the frequency of a specific word by the total frequency of all words in the corpus.
  • *100: Convert the decimal to a percentage.
  • round(…, 5): Round the result to 5 decimal places for readability.

2.3.3.2 Save the data R

write.xlsx(table_FreqWord_Top100, "ArticleETToday_Top100nouns.xlsx") save(table_FreqWord_Top100, file = "ArticleETToday_Top100nouns.Rdata") 
  • write.xlsx: Saves the top 100 words table to Excel.

  • save: Saves the R object for later use.

3 Markdown document, PDF output file, RData and Excel files of the scraped data

You can find the pre-filled Markdown document of this section here. Here is the PDF output of the same document.

The RData output file can be downloaded here for the KWIC analysis, and here for the frequency analysis. The corresponding Excel files are here (KWIC analysis) and here (frequency analysis).