Middle Data Format Protocol

In EduStudio, we adopt a flexible CSV (Comma-Separated Values) file format following Recbole. The flexible CSV format is defined in middata stage of dataset (see dataset stage protocol for details).

The Middle Data Format Protocol including two parts: Columns name Format and Filename Format.

Columns Name Format

feat_type	Explanations	Examples
token	single discrete feature	exer_id, stu_id
token_seq	discrete features sequence	knowledge concept seq of exercise
float	single continuous feature	label, start_timestamp
float_seq	continuous feature sequence	word2vec embedding of exercise

Filename format

So far, there are five atomic files in edustudio.

Note: Users could also load other types of data except the three atomic files below. {dt} is the dataset name.

filename format	description
{dt}.inter.csv	Student-Exercise Interaction data
{dt}.train.inter.csv	Student-Exercise Interaction data for training set
{dt}.train.inter.csv	Student-Exercise Interaction data for validation set
{dt}.train.inter.csv	Student-Exercise Interaction data for test set
{dt}.stu.csv	Features of students
{dt}.exer.csv	Features of exercises

Example

example_dt.inter.csv

stu_id:token	exer_id:token	label:float
0	1	0.0
1	0	1.0

example_dt.stu.csv

stu_id:token	gender:token	occupation:token
0	1	11
1	0	7

example_dt.exer.csv

exer_id:token	cpt_seq:token_seq	w2v_emb:float_seq
0	[0, 1]	[0.121, 0.123, 0.761]
1	[1, 2, 3]	[0.229, -0.113, 0.138]