Run a model with config files#

Note

The tutorial is for FuxiCTR v1.0.

This tutorial shows how to use YAML config files to define dataset and model hyper-parameters, and then run the model.

We take DeepFM_with_config.py in the demo directory as an example. The config files are located in demo/demo_config folder.

The dataset config dataset_config.yaml is as follows:

### Tiny data for tests only
taobao_tiny_data:
    data_root: ../data/
    data_format: csv
    train_data: ../data/tiny_data/train_sample.csv
    valid_data: ../data/tiny_data/valid_sample.csv
    test_data: ../data/tiny_data/test_sample.csv
    min_categr_count: 1
    feature_cols:
        - {name: ["userid","adgroup_id","pid","cate_id","campaign_id","customer","brand","cms_segid",
                  "cms_group_id","final_gender_code","age_level","pvalue_level","shopping_level","occupation"], 
                  active: True, dtype: str, type: categorical}
    label_col: {name: clk, dtype: float}

The model config model_config.yaml is as follows.

# The `Base` can be shared by different expid settings
Base: 
    model_root: '../checkpoints/'
    workers: 3
    verbose: 1
    patience: 2
    pickle_feature_encoder: True
    use_hdf5: True
    save_best_only: True
    every_x_epochs: 1
    debug: False

# The expid should be unique among all settings
DeepFM_test:
    model: DeepFM
    dataset_id: taobao_tiny_data # each expid corresponds to a dataset_id
    loss: 'binary_crossentropy'
    metrics: ['logloss', 'AUC']
    task: binary_classification
    optimizer: adam
    hidden_units: [64, 32]
    hidden_activations: relu
    net_regularizer: 0
    embedding_regularizer: 1.e-8
    learning_rate: 1.e-3
    batch_norm: False
    net_dropout: 0
    batch_size: 128
    embedding_dim: 4
    epochs: 1
    shuffle: True
    seed: 2019
    monitor: 'AUC'
    monitor_mode: 'max'

We use the Base to keep some common hyper-paramerters that could be shared by different expid settings. It is also flexible to merge all the key-values pairs in Base to FM_test for your convenience.

Note that the naming dataset_config and model_config should keep unchanged. Both dataset config and model config should be kept in the same directory: either 1) put dataset_config.yaml and model_config.yaml as shown in ./demo/demo_config, or 2) put in dataset_config and model_config folders as shown in ./config when a bunch of config files are available.

import sys
import os
from fuxictr.datasets import data_generator
from fuxictr.datasets.taobao import FeatureEncoder
from datetime import datetime
from fuxictr.utils import set_logger, print_to_json, load_config
import logging
from fuxictr.pytorch.models import DeepFM
from fuxictr.pytorch.utils import seed_everything

if __name__ == '__main__':
    # Load params from config files
    config_dir = 'demo_config'
    experiment_id = 'DeepFM_test'
    params = load_config(config_dir, experiment_id)

    # set up logger and random seed
    set_logger(params)
    logging.info(print_to_json(params))
    seed_everything(seed=params['seed'])

    # Set up feature encoder
    feature_encoder = FeatureEncoder(**params)
    feature_encoder.fit(train_data=params['train_data'], 
                        min_categr_count=params['min_categr_count'])

    # Build train/validation/test data generators
    train_gen, valid_gen, test_gen = data_generator(feature_encoder,
                                                    train_data=params['train_data'],
                                                    valid_data=params['valid_data'],
                                                    test_data=params['test_data'],
                                                    batch_size=params['batch_size'],
                                                    shuffle=params['shuffle'],
                                                    use_hdf5=params['use_hdf5'])
    # Build a DeepFM model
    model = DeepFM(feature_encoder.feature_map, **params)
    model.fit_generator(train_gen, validation_data=valid_gen, epochs=params['epochs'],
                        verbose=params['verbose'])
   
    # Reloading weights of the best checkpoint
    model.load_weights(model.checkpoint)

    # Evalution on validation
    logging.info('***** validation results *****')
    model.evaluate_generator(valid_gen)

    # Evalution on test
    logging.info('***** test results *****')
    model.evaluate_generator(test_gen)

For easy use, we also provide a useful tool script run_expid.py to run FuxiCTR models based on YAML config files.

  • –config: The config directory of data and model config files.

  • –expid: The given expid that denotes the detailed experimental settings.

  • –gpu: The gpu index used for experiment, and -1 for CPU.

Try the following examples:

!cd benchmarks
# run the demo config
!python run_expid.py --config ../demo/demo_config --expid DeepFM_test --gpu 0
# run DeepFM_test, located in config/model_config/tests.yaml
!python run_expid.py --config ../config --expid DeepFM_test --gpu 0