Run a model with h5 data as input
Run a model with h5 data as input#
Note
The tutorial is for FuxiCTR v1.2.
This tutorial presents how to run a model with h5 data as input.
FuxiCTR supports both csv and h5 data as input. After running a model with csv dataset config, the h5 data will be generated at the model_root
path. One can reuse the produced h5 data for other experiments flexibly. We demonstrate this with the data taobao_tiny_h5
as follows.
# demo/demo_config/dataset_config.yaml
taobao_tiny_h5:
data_root: ../data/
data_format: h5
train_data: ../data/taobao_tiny_h5/train.h5
valid_data: ../data/taobao_tiny_h5/valid.h5
test_data: ../data/taobao_tiny_h5/test.h5
Note that each h5 dataset contains a feature_map.json
file, which saves the feature specifications required for data loading and model training. Take the following feature_map.json as an example.
# data/taobao_tiny_h5/feature_map.json
{
"dataset_id": "taobao_tiny_h5",
"num_fields": 14,
"num_features": 476,
"input_length": 14,
"feature_specs": {
"userid": {
"source": "",
"type": "categorical",
"vocab_size": 25,
"index": 0
},
"adgroup_id": {
"source": "",
"type": "categorical",
"vocab_size": 100,
"index": 1
},
"pid": {
"source": "",
"type": "categorical",
"vocab_size": 3,
"index": 2
},
"cate_id": {
"source": "",
"type": "categorical",
"vocab_size": 48,
"index": 3
},
"campaign_id": {
"source": "",
"type": "categorical",
"vocab_size": 98,
"index": 4
},
"customer": {
"source": "",
"type": "categorical",
"vocab_size": 97,
"index": 5
},
"brand": {
"source": "",
"type": "categorical",
"vocab_size": 66,
"index": 6
},
"cms_segid": {
"source": "",
"type": "categorical",
"vocab_size": 10,
"index": 7
},
"cms_group_id": {
"source": "",
"type": "categorical",
"vocab_size": 10,
"index": 8
},
"final_gender_code": {
"source": "",
"type": "categorical",
"vocab_size": 3,
"index": 9
},
"age_level": {
"source": "",
"type": "categorical",
"vocab_size": 6,
"index": 10
},
"pvalue_level": {
"source": "",
"type": "categorical",
"vocab_size": 3,
"index": 11
},
"shopping_level": {
"source": "",
"type": "categorical",
"vocab_size": 4,
"index": 12
},
"occupation": {
"source": "",
"type": "categorical",
"vocab_size": 3,
"index": 13
}
}
}
Given the h5 dataset as input, we provide a demo to run DeepFM in DeepFM_with_h5_config.py
. The core code is as follows:
# load the params
config_dir = 'demo_config'
experiment_id = 'DeepFM_test_h5' # correponds to h5 input `taobao_tiny_h5`
params = load_config(config_dir, experiment_id)
# Load feature_map from json
data_dir = os.path.join(params['data_root'], params['dataset_id'])
feature_map = FeatureMap(params['dataset_id'], data_dir)
feature_map.load(os.path.join(data_dir, "feature_map.json"))
# Get train and validation data generator from h5
train_gen, valid_gen = datasets.h5_generator(feature_map,
stage='train',
train_data=os.path.join(data_dir, 'train.h5'),
valid_data=os.path.join(data_dir, 'valid.h5'),
batch_size=params['batch_size'],
shuffle=params['shuffle'])
# Model initialization and fitting
model = DeepFM(feature_map, **params)
model.count_parameters() # print number of parameters used in model
model.fit_generator(train_gen,
validation_data=valid_gen,
epochs=params['epochs'],
verbose=params['verbose'])
The full code is available in demo/DeepFM_with_h5_config.py
. You can run the demo as shown below. In addition, if you would like to change the setting of a feature field, you can modify the corresponding values in feature_map.json.
!cd demo
python DeepFM_with_h5_config.py