Run a model with h5 data as input#

Note

The tutorial is for FuxiCTR v1.2.

This tutorial presents how to run a model with h5 data as input.

FuxiCTR supports both csv and h5 data as input. After running a model with csv dataset config, the h5 data will be generated at the model_root path. One can reuse the produced h5 data for other experiments flexibly. We demonstrate this with the data taobao_tiny_h5 as follows.

# demo/demo_config/dataset_config.yaml
taobao_tiny_h5:
    data_root: ../data/
    data_format: h5
    train_data: ../data/taobao_tiny_h5/train.h5
    valid_data: ../data/taobao_tiny_h5/valid.h5
    test_data: ../data/taobao_tiny_h5/test.h5

Note that each h5 dataset contains a feature_map.json file, which saves the feature specifications required for data loading and model training. Take the following feature_map.json as an example.

# data/taobao_tiny_h5/feature_map.json
{
    "dataset_id": "taobao_tiny_h5",
    "num_fields": 14,
    "num_features": 476,
    "input_length": 14,
    "feature_specs": {
        "userid": {
            "source": "",
            "type": "categorical",
            "vocab_size": 25,
            "index": 0
        },
        "adgroup_id": {
            "source": "",
            "type": "categorical",
            "vocab_size": 100,
            "index": 1
        },
        "pid": {
            "source": "",
            "type": "categorical",
            "vocab_size": 3,
            "index": 2
        },
        "cate_id": {
            "source": "",
            "type": "categorical",
            "vocab_size": 48,
            "index": 3
        },
        "campaign_id": {
            "source": "",
            "type": "categorical",
            "vocab_size": 98,
            "index": 4
        },
        "customer": {
            "source": "",
            "type": "categorical",
            "vocab_size": 97,
            "index": 5
        },
        "brand": {
            "source": "",
            "type": "categorical",
            "vocab_size": 66,
            "index": 6
        },
        "cms_segid": {
            "source": "",
            "type": "categorical",
            "vocab_size": 10,
            "index": 7
        },
        "cms_group_id": {
            "source": "",
            "type": "categorical",
            "vocab_size": 10,
            "index": 8
        },
        "final_gender_code": {
            "source": "",
            "type": "categorical",
            "vocab_size": 3,
            "index": 9
        },
        "age_level": {
            "source": "",
            "type": "categorical",
            "vocab_size": 6,
            "index": 10
        },
        "pvalue_level": {
            "source": "",
            "type": "categorical",
            "vocab_size": 3,
            "index": 11
        },
        "shopping_level": {
            "source": "",
            "type": "categorical",
            "vocab_size": 4,
            "index": 12
        },
        "occupation": {
            "source": "",
            "type": "categorical",
            "vocab_size": 3,
            "index": 13
        }
    }
}

Given the h5 dataset as input, we provide a demo to run DeepFM in DeepFM_with_h5_config.py. The core code is as follows:

# load the params
config_dir = 'demo_config'
experiment_id = 'DeepFM_test_h5' # correponds to h5 input `taobao_tiny_h5`
params = load_config(config_dir, experiment_id)

# Load feature_map from json
data_dir = os.path.join(params['data_root'], params['dataset_id'])
feature_map = FeatureMap(params['dataset_id'], data_dir)
feature_map.load(os.path.join(data_dir, "feature_map.json"))

# Get train and validation data generator from h5
train_gen, valid_gen = datasets.h5_generator(feature_map, 
                                             stage='train', 
                                             train_data=os.path.join(data_dir, 'train.h5'),
                                             valid_data=os.path.join(data_dir, 'valid.h5'),
                                             batch_size=params['batch_size'],
                                             shuffle=params['shuffle'])

# Model initialization and fitting                                                  
model = DeepFM(feature_map, **params)
model.count_parameters() # print number of parameters used in model
model.fit_generator(train_gen, 
                    validation_data=valid_gen, 
                    epochs=params['epochs'],
                    verbose=params['verbose'])

The full code is available in demo/DeepFM_with_h5_config.py. You can run the demo as shown below. In addition, if you would like to change the setting of a feature field, you can modify the corresponding values in feature_map.json.

!cd demo
python DeepFM_with_h5_config.py