Training N2JNet#

To train N2JNet, pass in the config yml file as the argument, e.g.

$python train_n2jnet.py nersc_config.yml

The “training” section of the nersc_config.yml config file in the repo provides an example of how to configure training. We take a look at it here.

First, we configure the dataloader. This means setting the training healpixes of CosmoDC2, batch size, number of CPU cores, input galaxy features, as well as the galaxy selection and photometric noise level.

# Dataloader kwargs
data:
  train_dist_name: 'norm'
  train_dist_kwargs:
    loc: 0.01
    scale: 0.04
  in_dir: '/global/cscratch1/sd/jwp/n2j/data_v04'
  train_hp: [9559, 10327, 9687, 9814, 9815, 9816, 9942, 9943, 10070, 10071, 10072, 10198]
  val_hp: [10199, 10200, 10450]
  n_train: [50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000]
  # Final effective training set size
  n_subsample_train: 200000
  n_val: [50000, 50000, 50000]
  # Final effective val set size
  n_subsample_val: 1000
  batch_size: 1000
  val_batch_size: 1000
  num_workers: 18
  # Global (graph-level) target; final_gamma1, final_gamm2 also available
  sub_target: ['final_kappa']
  # Local (node-level) target
  sub_target_local: ['stellar_mass', 'redshift']
  # Features available; do not modify (determined at data generation time)
  features: ['galaxy_id', 'ra', 'dec', 'redshift',
              'ra_true', 'dec_true', 'redshift_true',
              'bulge_to_total_ratio_i',
              'ellipticity_1_true', 'ellipticity_2_true',
              'ellipticity_1_bulge_true', 'ellipticity_1_disk_true',
              'ellipticity_2_bulge_true', 'ellipticity_2_disk_true',
              'shear1', 'shear2', 'convergence',
              'size_bulge_true', 'size_disk_true', 'size_true',
              'mag_u_lsst', 'mag_g_lsst', 'mag_r_lsst',
              'mag_i_lsst', 'mag_z_lsst', 'mag_Y_lsst']
  # Features to use as input
  sub_features: ['ra_true', 'dec_true',
                  'mag_u_lsst', 'mag_g_lsst', 'mag_r_lsst',
                  'mag_i_lsst', 'mag_z_lsst', 'mag_Y_lsst']
  noise_kwargs:
    mag:
      override_kwargs: null
      depth: 5
  detection_kwargs:
    ref_features: ['mag_i_lsst']
    max_vals: [25.3]
# Optimizer kwargs
optimization:
  early_stop_memory: 50
  weight_local_loss: 0.1
  optim_kwargs:
    lr: 0.002
    weight_decay: 0.0001
  lr_scheduler_kwargs:
    patience: 5
    factor: 0.5
    min_lr: 0.0000001
    verbose: True

Then we configure the optimizer, i.e. the weighting between the loss of global convergence labels and local stellar mass and redshift labels, early-stopping, initial learning rate, learning rate decay.

# Optimizer kwargs
optimization:
  early_stop_memory: 50
  weight_local_loss: 1.0
  optim_kwargs:
    lr: 0.001
    weight_decay: 0.0001
  lr_scheduler_kwargs:
    patience: 5
    factor: 0.5
    min_lr: 0.0000001
    verbose: True

We then configure the depth and width of the network architecture. The global flow key determines whether the final layer predicting the convergence is a normalizing flow.

# Model kwargs
model:
  dim_local: 50
  dim_global: 50
  dim_hidden: 50
  dim_pre_aggr: 50
  n_iter: 5
  n_out_layers: 5
  dropout: 0.04
  global_flow: False

Lastly, we set more general attributes of training such as the seed, device, and the number of training epochs.

# Trainer attributes
trainer:
  device_type: 'cuda'
  checkpoint_dir: results/E1
  seed: 1028
n_epochs: 200
# If you want to resume training from a checkpoint
resume_from:
  checkpoint_path: null