Hello everyone. I am Li, a research student. I would like to introduce the content of a paper I read in a recent English Literature Seminar along with my personal thoughts.
-
Paper Title: KnowEdu: A System to Construct Knowledge Graph for Education
-
Year of Publication: 2018
-
Authors: Penghe Chen, Yu Lu, Vincent W. Zheng, Xiyang Chen and Boda Yang
-
Journal: IEEE Access
-
Pages: 31553 – 31563
Knowledge graphs, as information-intensive knowledge bases that effectively integrate diverse data from different fields, are already widely utilized in many general-purpose areas, such as Google’s Knowledge Graph and Apple’s Siri. However, such general-purpose knowledge graphs do not sufficiently meet the needs of specific domains like education. The primary reasons for this are that advanced and specialized knowledge is required in the educational field, and conventional named entity extraction methods cannot effectively handle education-specific terminology and complex relationships.
In the educational field, knowledge graphs (also referred to as concept maps or knowledge maps) are broadly used for classroom support, learning recommendations in online courses, and visualization of concepts. In fact, knowledge graphs have been introduced even in Khan Academy, a major Massive Open Online Course (MOOC) platform. However, currently, most of these educational knowledge graphs are manually created by experienced teachers or experts. Such manual creation is extremely time-consuming and labor-intensive, making it difficult to keep up with the rapid expansion of knowledge and course scales. Furthermore, since there are often discrepancies between experts’ perceptions and students’ actual cognitive states (the so-called “expert blind spot”), manually created knowledge graphs can sometimes lead students’ learning in the wrong direction.
To solve these challenges, this study proposes “KnowEdu,” a system that automatically constructs educational knowledge graphs suitable for school education and online learning. Specifically, the KnowEdu system is composed of two primary modules: the “Instructional Concept Extraction” module and the “Educational Relation Identification” module.
The Instructional Concept Extraction module utilizes education-specific teaching material data, such as curriculum guidelines, textbooks, and classroom materials. These materials are typically well-structured, have clear meanings, and high knowledge density, making them very suitable for extracting educational concepts. The main purpose of this module is to extract technical terms related to education—such as “linear equations” or “photosynthesis”—from education-specific materials, rather than conventional named entities like names of people or places. The system first converts teaching material data into machine-readable text data using OCR or speech recognition technology, and then automatically extracts educational concept nodes from the text using neural sequence labeling techniques such as GRU (Gated Recurrent Unit) and LSTM (Long Short-Term Memory). Compared to conventional Conditional Random Field (CRF) models, neural network models can automatically learn features and more effectively handle context dependencies.
The Educational Relation Identification module extracts abstract and latent relationships between educational concepts, such as “prerequisite relations,” “inclusion relations,” and “causal relations.” This study particularly focuses on prerequisite relations. Prerequisite relations are latent relationships based on cognitive processes between concepts and are very helpful for teachers in lesson design and students in setting learning paths. Teachers typically judge prerequisite relations based on students’ learning status. Specifically, the situations are as follows:
-
If a student understands Concept B, they must already understand Concept A, which is the prerequisite concept for Concept B.
-
If a student does not understand Concept A, it is highly likely they will not be able to understand Concept B.
From the system’s perspective, to judge that “Concept A is prerequisite knowledge for Concept B,” two conditions must be satisfied simultaneously: “¬B ⇒ ¬A” (if B is not understood, A is not understood) and “A ⇒ B” (if A is understood, B can be understood).
To realize this, the KnowEdu system utilizes students’ learning assessment data, such as test results, assignment submission status, and learning logs from MOOCs, and identifies prerequisite relations using probabilistic association rule mining (p-Apriori). By analyzing students’ understanding of different knowledge areas, prerequisite relations between concepts can be effectively discovered.
This study verified the performance of the KnowEdu system using a specific case in mathematics. As a result, the accuracy of educational concept extraction (F1 score) was significantly higher than the conventional CRF model, with the neural network model exceeding an F1 score of 0.7. Additionally, as a result of identifying prerequisite relations using data from 4,488 seventh-grade students enrolled in 31 middle schools in Beijing, the identification accuracy (AUC) reached 0.95 and the Mean Average Precision (MAP) was 0.87, showing extremely high effectiveness.
While the KnowEdu system achieved great results in the automatic construction of educational knowledge graphs, there is still room for further expansion and improvement. The authors point out that the current system only handles educational concept and relation extraction within a single subject. In reality, concepts such as “functions” exist across multiple subjects like mathematics and physics; therefore, constructing cross-disciplinary knowledge graphs will be an important research task in the future. They also noted that in humanities subjects, because there are many emotional elements and expressions are ambiguous, the difficulty of constructing knowledge graphs becomes higher compared to science subjects.
The following are my own thoughts on this study. In this research, a neural sequence labeling model was introduced for the first time for concept extraction tasks in the educational field, and I felt the methodology had originality. Furthermore, the structure of the paper is clear, and I felt there was much to learn in terms of understanding the framework for constructing educational knowledge graphs.
On the other hand, regarding the data used in this study, the scope is limited to learning assessment data collected from online learning platforms such as MOOCs. While this is due to the characteristics of the neural network model and design constraints, I believe the lack of other data imposes certain limits on the research’s applicability and extensibility. Specifically, I think it might be possible to more accurately grasp latent relationships between concepts by utilizing more multifaceted data, such as learners’ behavioral histories, learning process logs, or interaction data within the classroom. Therefore, the analysis and use of such additional data will likely become an important research direction in constructing educational knowledge graphs.
Additionally, there were interesting points regarding the judgment method for prerequisite relations presented in the paper. In this study, the system automatically extracts prerequisite relations by referring to the way teachers empirically judge them. This approach suggests that by appropriately incorporating teachers’ practical insights from educational settings, it might be possible to go beyond simple automation and alleviate the “black box problem” characteristic of neural networks. That is, for the problem where it is difficult to understand how the model extracted relationships between concepts, appropriately incorporating teachers’ empirical judgments would be helpful in improving the model’s explainability and reliability.
Furthermore, I empathize with the difference in difficulty in constructing knowledge graphs between humanities and science subjects mentioned in the text. In fact, since the relation extraction method used in this paper basically has strong rule-based features, it is well-suited for subjects with clear logic and simple structures like science fields. Conversely, it is predicted that accuracy will decrease when applied to humanities fields that include many ambiguous and emotional expressions. While recent models can process vast amounts of text data with high precision due to developments in natural language processing, such complex models also have the disadvantage of being prone to decreased validity and interpretability of output results due to their black-box aspects. Therefore, especially when targeting humanities subjects, I believe there is room to improve final results by more actively incorporating specific and practical judgments and insights from the educational front.
Overall, the KnowEdu system in this study shows excellent results in automatically constructing educational knowledge graphs, and I felt that the effective utilization of educational data and coordination with educational settings will be key to further deepening research.




