Customize Atomic Operation
In EduStudio, we treat the whole data processing as multiple atomic operations called atomic operation sequence.
The first atomic operation, inheriting the protocol class BaseRaw2Mid
, is the process from raw data to middle data.
The following atomic operations, inheriting the protocol class BaseMid2Cache
, construct the process from middle data to cache data.
BaseRaw2Mid
The atomic operations inheriting BaseRaw2Mid
preprocess the raw dataset into middle dataset (standardized data files).
Protocols
The protocols in BaseRaw2Mid
are listed as follows:
name |
description |
type |
note |
---|---|---|---|
self.dt |
current dataset name |
instance variable |
given in BaseRaw2Mid |
self.rawpath |
raw data path of current dataset |
instance variable |
given in BaseRaw2Mid |
self.midpath |
middle data path of current dataset |
instance variable |
given in BaseRaw2Mid |
self.logger |
logger object |
instance variable |
given in BaseRaw2Mid |
process |
preprocess the raw dataset into middle dataset |
function interface |
implemented by subclass |
Example
The following example illustrates the process of Assistment 2019-2010
dataset from raw data to middle data.
class R2M_ASSIST_0910(BaseRaw2Mid):
def process(self):
df = pd.read_csv(f"{self.rawpath}/skill_builder_data.csv", encoding='ISO-8859-1')
......
df_inter.to_csv(f"{self.midpath}/{self.dt}.inter.csv", index=False, encoding='utf-8')
df_user.to_csv(f"{self.midpath}/{self.dt}.stu.csv", index=False, encoding='utf-8')
df_exer.to_csv(f"{self.midpath}/{self.dt}.exer.csv", index=False, encoding='utf-8')
The above function first read raw data file from specific folder path (i.e., self.rawpath). After processing middle data, it will save middle data in specific folder path (i.e., self.midpath).
BaseMid2Cache
The atomic operations inheriting BaseMid2Cache
preprocess the middle dataset into cache dataset (standardized data files). Different from atomic operations inheriting BaseRaw2Mid
, in one atomic operation sequence, atomic operations inheriting BaseRaw2Mid
should be unique and be the first position. Atomic operations inheriting BaseMid2Cache
could be multiple and dominate following all operations.
Protocols
The protocols in BaseMid2Cache
are listed as follows:
name |
description |
type |
note |
---|---|---|---|
default_cfg |
the default configuration of operation |
class variable |
|
self.logger |
logger object |
instance variable |
given in BaseMid2Cache |
self.m2c_cfg |
actual configuration in running process |
instance variable |
given in BaseMid2Cache |
_check_params |
check rationality of configuration |
function interface |
implemented by subclass |
process |
preprocess the raw dataset into middle dataset |
function interface |
implemented by subclass |
set_dt_info |
store dataset information in the process (such as student number) |
function interface |
implemented by subclass |
Example
The following example illustrates the partial process code of M2C_RandomDataSplit4CD
atomic operation, which splits datasets for cognitive diagnosis.
class M2C_RandomDataSplit4CD(BaseMid2Cache):
default_cfg = {
'seed': 2023,
"divide_scale_list": [7,1,2],
}
def _check_params(self):
super()._check_params()
assert 2 <= len(self.m2c_cfg['divide_scale_list']) <= 3
assert sum(self.m2c_cfg['divide_scale_list']) == 10
def process(self, **kwargs):
df = kwargs['df']
if self.n_folds == 1:
assert kwargs.get("df_train", None) is None
assert kwargs.get("df_valid", None) is None
assert kwargs.get("df_test", None) is None
df_train, df_valid, df_test = self.one_fold_split(df)
kwargs['df_train_folds'] = [df_train]
kwargs['df_valid_folds'] = [df_valid] if df_valid is not None else []
kwargs['df_test_folds'] = [df_test]
else:
df_train_list, df_test_list = self.multi_fold_split(df)
kwargs['df_train_folds'] = df_train_list
kwargs['df_test_folds'] = df_test_list
return kwargs
def set_dt_info(self, dt_info, **kwargs):
if 'stu_id:token' in kwargs['df'].columns:
dt_info['stu_count'] = int(kwargs['df']['stu_id:token'].max() + 1)