Run a model with config files#

Note

The tutorial is for FuxiCTR v1.2.

This tutorial shows how to use YAML config files to define dataset and model hyper-parameters, and then run the model.

We take DeepFM_with_csv_config.py in the demo directory as an example. The config files are located in demo/demo_config folder.

The dataset config dataset_config.yaml is as follows:

### Tiny data for tests only
taobao_tiny:
    data_root: ../data/
    data_format: csv
    train_data: ../data/tiny_data/train_sample.csv
    valid_data: ../data/tiny_data/valid_sample.csv
    test_data: ../data/tiny_data/test_sample.csv
    min_categr_count: 1
    feature_cols:
        - {name: ["userid","adgroup_id","pid","cate_id","campaign_id","customer","brand","cms_segid",
                  "cms_group_id","final_gender_code","age_level","pvalue_level","shopping_level","occupation"], 
                  active: True, dtype: str, type: categorical}
    label_col: {name: clk, dtype: float}

The model config model_config.yaml is as follows.

# The `Base` can be shared by different expid settings
Base: 
    model_root: '../checkpoints/'
    workers: 3
    verbose: 1
    patience: 2
    pickle_feature_encoder: True
    use_hdf5: True
    save_best_only: True
    every_x_epochs: 1
    debug: False

# The expid should be unique among all settings
DeepFM_test:
    model: DeepFM
    dataset_id: taobao_tiny_data # each expid corresponds to a dataset_id
    loss: 'binary_crossentropy'
    metrics: ['logloss', 'AUC']
    task: binary_classification
    optimizer: adam
    hidden_units: [64, 32]
    hidden_activations: relu
    net_regularizer: 0
    embedding_regularizer: 1.e-8
    learning_rate: 1.e-3
    batch_norm: False
    net_dropout: 0
    batch_size: 128
    embedding_dim: 4
    epochs: 1
    shuffle: True
    seed: 2019
    monitor: 'AUC'
    monitor_mode: 'max'

We use the Base to keep some common hyper-paramerters that could be shared by different expid settings. It is also flexible to merge all the key-values pairs in Base to FM_test for your convenience.

Note that the naming dataset_config and model_config should keep unchanged. Both dataset config and model config should be kept in the same directory: either 1) put dataset_config.yaml and model_config.yaml as shown in ./demo/demo_config, or 2) put in dataset_config and model_config folders as shown in ./config when a bunch of config files are available.

import sys
import os
import logging
from datetime import datetime
from fuxictr import datasets
from fuxictr.datasets.taobao import FeatureEncoder
from fuxictr.features import FeatureMap
from fuxictr.utils import load_config, set_logger, print_to_json
from fuxictr.pytorch.models import DeepFM
from fuxictr.pytorch.torch_utils import seed_everything

if __name__ == '__main__':
    # Load params from config files
    config_dir = 'demo_config'
    experiment_id = 'DeepFM_test' # correponds to csv input `taobao_tiny`
    params = load_config(config_dir, experiment_id)

    # set up logger and random seed
    set_logger(params)
    logging.info(print_to_json(params))
    seed_everything(seed=params['seed'])

    # Set feature_encoder that defines how to preprocess data
    feature_encoder = FeatureEncoder(params['feature_cols'], 
                                     params['label_col'], 
                                     dataset_id=params['dataset_id'], 
                                     data_root=params["data_root"])

    # Build dataset from csv to h5
    datasets.build_dataset(feature_encoder, 
                           train_data=params["train_data"], 
                           valid_data=params["valid_data"], 
                           test_data=params["test_data"])
    
    # Get feature_map that defines feature specs
    feature_map = feature_encoder.feature_map

    # Get train and validation data generator from h5
    data_dir = os.path.join(params['data_root'], params['dataset_id'])
    train_gen, valid_gen = datasets.h5_generator(feature_map, 
                                                 stage='train', 
                                                 train_data=os.path.join(data_dir, 'train.h5'),
                                                 valid_data=os.path.join(data_dir, 'valid.h5'),
                                                 batch_size=params['batch_size'],
                                                 shuffle=params['shuffle'])
    
    # Model initialization and fitting                                                  
    model = DeepFM(feature_encoder.feature_map, **params)
    model.count_parameters() # print number of parameters used in model
    model.fit_generator(train_gen, 
                        validation_data=valid_gen, 
                        epochs=params['epochs'],
                        verbose=params['verbose'])
    model.load_weights(model.checkpoint) # reload the best checkpoint
    
    logging.info('***** validation results *****')
    model.evaluate_generator(valid_gen)

    logging.info('***** validation results *****')
    test_gen = datasets.h5_generator(feature_map, 
                                     stage='test',
                                     test_data=os.path.join(data_dir, 'test.h5'),
                                     batch_size=params['batch_size'],
                                     shuffle=False)
    model.evaluate_generator(test_gen)

For easy use, we also provide a useful tool script run_expid.py to run FuxiCTR models based on YAML config files.

  • –config: The config directory of data and model config files.

  • –expid: The given expid that denotes the detailed experimental settings.

  • –gpu: The gpu index used for experiment, and -1 for CPU.

Try the following examples:

!cd benchmarks
# run the demo config
!python run_expid.py --config ../demo/demo_config --expid DeepFM_test --gpu 0
# run DeepFM_test, located in config/model_config/tests.yaml
!python run_expid.py --config ../config --expid DeepFM_test --gpu 0