Middle Data Format Protocol
In EduStudio
, we adopt a flexible CSV (Comma-Separated Values) file format following Recbole. The flexible CSV format is defined in middata
stage of dataset (see dataset stage protocol for details).
The Middle Data Format Protocol including two parts: Columns name Format
and Filename Format
.
Columns Name Format
feat_type |
Explanations |
Examples |
---|---|---|
token |
single discrete feature |
exer_id, stu_id |
token_seq |
discrete features sequence |
knowledge concept seq of exercise |
float |
single continuous feature |
label, start_timestamp |
float_seq |
continuous feature sequence |
word2vec embedding of exercise |
Filename format
So far, there are five atomic files in edustudio.
Note: Users could also load other types of data except the three atomic files below. {dt}
is the dataset name.
filename format |
description |
---|---|
{dt}.inter.csv |
Student-Exercise Interaction data |
{dt}.train.inter.csv |
Student-Exercise Interaction data for training set |
{dt}.train.inter.csv |
Student-Exercise Interaction data for validation set |
{dt}.train.inter.csv |
Student-Exercise Interaction data for test set |
{dt}.stu.csv |
Features of students |
{dt}.exer.csv |
Features of exercises |
Example
example_dt.inter.csv
stu_id:token |
exer_id:token |
label:float |
---|---|---|
0 |
1 |
0.0 |
1 |
0 |
1.0 |
example_dt.stu.csv
stu_id:token |
gender:token |
occupation:token |
---|---|---|
0 |
1 |
11 |
1 |
0 |
7 |
example_dt.exer.csv
exer_id:token |
cpt_seq:token_seq |
w2v_emb:float_seq |
---|---|---|
0 |
[0, 1] |
[0.121, 0.123, 0.761] |
1 |
[1, 2, 3] |
[0.229, -0.113, 0.138] |