Dataset Status Protocol
In Edustudio
, we view the dataset as three statuses: rawdata
, middata
, cachedata
.
inconsistent rawdata: the original data format provided by the dataset publisher.
standardized middata: the standardized middle data format(see Middle Data Format Protocol) defined by EduStudio.
model-friendly cachedata: the data format that is convenient for model usage.
Dataset Folder Format Example
All datasets are required to store in a unified folder. The example below illustrated the FrcSub
dataset folder format.
data/
├── FrcSub
│ ├── cachedata
│ │ ├── FrcSub_five_fold
│ │ │ ├── datatpl_cfg.json
│ │ │ ├── df_exer.pkl
│ │ │ ├── df_stu.pkl
│ │ │ ├── dict_test_folds.pkl
│ │ │ ├── dict_train_folds.pkl
│ │ │ ├── dict_valid_folds.pkl
│ │ │ └── final_kwargs.pkl
│ │ └── FrcSub_one_fold
│ │ ├── datatpl_cfg.json
│ │ ├── df_exer.pkl
│ │ ├── df_stu.pkl
│ │ ├── dict_test_folds.pkl
│ │ ├── dict_train_folds.pkl
│ │ ├── dict_valid_folds.pkl
│ │ └── final_kwargs.pkl
│ ├── middata
│ │ ├── FrcSub.exer.csv
│ │ └── FrcSub.inter.csv
│ └── rawdata
│ ├── data.txt
│ ├── problemdesc.txt
│ ├── qnames.txt
│ └── q.txt
Dataset Stage protocol
The rawdata
folder stores raw data files. For different datasets, they have different raw data file format.
There is an example that loading dataset from rawdata
.
from edustudio.quickstart import run_edustudio
run_edustudio(
dataset='FrcSub',
cfg_file_name=None,
traintpl_cfg_dict={
'cls': 'GeneralTrainTPL',
},
datatpl_cfg_dict={
'cls': 'CDInterExtendsQDataTPL',
'load_data_from": "rawdata", # specify the loading stage of the dataset
'raw2mid_op': 'R2M_FrcSub' # specify the R2M atomic operation
},
modeltpl_cfg_dict={
'cls': 'KaNCD',
},
evaltpl_cfg_dict={
'clses': ['PredictionEvalTPL', 'InterpretabilityEvalTPL'],
}
)
The middata
folder stores middle data files. For existing datasets, we provide the atomic operation inheriting the protocol class BaseRaw2Mid
, which process raw data to middle data. The middle is required in atomic file protocol.
There is an example that loading dataset from middata
.
from edustudio.quickstart import run_edustudio
run_edustudio(
dataset='FrcSub',
cfg_file_name=None,
traintpl_cfg_dict={
'cls': 'GeneralTrainTPL',
},
datatpl_cfg_dict={
'cls': 'CDInterExtendsQDataTPL',
'load_data_from": "middata", # specify the loading stage of the dataset
'is_save_cache': True # whether to save cache data
'cache_id': 'cache_default', # cache id, valid when is_save_cache=True
},
modeltpl_cfg_dict={
'cls': 'KaNCD',
},
evaltpl_cfg_dict={
'clses': ['PredictionEvalTPL', 'InterpretabilityEvalTPL'],
}
)
With the middata
of following atomic file protocol, we can implement some other atomic operations inheriting the protocol class BaseMid2Cache
to build cachedata
from middata
.
There is an example that loading dataset from cachedata
.
from edustudio.quickstart import run_edustudio
run_edustudio(
dataset='FrcSub',
cfg_file_name=None,
traintpl_cfg_dict={
'cls': 'GeneralTrainTPL',
},
datatpl_cfg_dict={
'cls': 'CDInterExtendsQDataTPL',
'load_data_from": "cachedata", # specify the loading stage of the dataset
'cache_id': 'cache_default', # cache id, valid when is_save_cache=True
},
modeltpl_cfg_dict={
'cls': 'KaNCD',
},
evaltpl_cfg_dict={
'clses': ['PredictionEvalTPL', 'InterpretabilityEvalTPL'],
}
)
Example:Atomic Operation Sequence of Data Processing
R2M_FrcSub: process the Frcsub dataset from
rawdata
tomidata
M2C_FilterRecords4CD:Filter students or exercises whose number of interaction records is less than a threshold
M2C_ReMapId: ReMap feature Id
M2C_RandomDataSplit4CD: Split Datasets
M2C_GenQMat: Generate Q-matrix
The ‘mid2cache_op_seq’ option in datatpl_cfg specify the atomic operation sequence
from edustudio.quickstart import run_edustudio
run_edustudio(
dataset='FrcSub',
cfg_file_name=None,
traintpl_cfg_dict={
'cls': 'GeneralTrainTPL',
},
datatpl_cfg_dict={
'cls': 'CDInterExtendsQDataTPL',
'load_data_from": "rawdata", # specify the loading stage of the dataset
'raw2mid_op': 'R2M_FrcSub',
# the 'mid2cache_op_seq' option specify the atomic operation sequence
'mid2cache_op_seq': ['M2C_Label2Int', 'M2C_FilterRecords4CD', 'M2C_ReMapId', 'M2C_RandomDataSplit4CD', 'M2C_GenQMat'],
},
modeltpl_cfg_dict={
'cls': 'KaNCD',
},
evaltpl_cfg_dict={
'clses': ['PredictionEvalTPL', 'InterpretabilityEvalTPL'],
}
)