===============
Training N2JNet
===============

To train `N2JNet`, pass in the config yml file as the argument, e.g.

::

$python train_n2jnet.py nersc_config.yml


The "training" section of the `nersc_config.yml` config file in the repo provides an example of how to configure training. We take a look at it here.

First, we configure the dataloader. This means setting the training healpixes of CosmoDC2, batch size, number of CPU cores, input galaxy features, as well as the galaxy selection and photometric noise level.

::

    # Dataloader kwargs
    data:
      train_dist_name: 'norm'
      train_dist_kwargs:
        loc: 0.01
        scale: 0.04
      in_dir: '/global/cscratch1/sd/jwp/n2j/data_v04'
      train_hp: [9559, 10327, 9687, 9814, 9815, 9816, 9942, 9943, 10070, 10071, 10072, 10198]
      val_hp: [10199, 10200, 10450]
      n_train: [50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000, 50000]
      # Final effective training set size
      n_subsample_train: 200000
      n_val: [50000, 50000, 50000]
      # Final effective val set size
      n_subsample_val: 1000
      batch_size: 1000
      val_batch_size: 1000
      num_workers: 18
      # Global (graph-level) target; final_gamma1, final_gamm2 also available
      sub_target: ['final_kappa']
      # Local (node-level) target
      sub_target_local: ['stellar_mass', 'redshift']
      # Features available; do not modify (determined at data generation time)
      features: ['galaxy_id', 'ra', 'dec', 'redshift',
                  'ra_true', 'dec_true', 'redshift_true',
                  'bulge_to_total_ratio_i',
                  'ellipticity_1_true', 'ellipticity_2_true',
                  'ellipticity_1_bulge_true', 'ellipticity_1_disk_true',
                  'ellipticity_2_bulge_true', 'ellipticity_2_disk_true',
                  'shear1', 'shear2', 'convergence',
                  'size_bulge_true', 'size_disk_true', 'size_true',
                  'mag_u_lsst', 'mag_g_lsst', 'mag_r_lsst',
                  'mag_i_lsst', 'mag_z_lsst', 'mag_Y_lsst']
      # Features to use as input
      sub_features: ['ra_true', 'dec_true',
                      'mag_u_lsst', 'mag_g_lsst', 'mag_r_lsst',
                      'mag_i_lsst', 'mag_z_lsst', 'mag_Y_lsst']
      noise_kwargs:
        mag:
          override_kwargs: null
          depth: 5
      detection_kwargs:
        ref_features: ['mag_i_lsst']
        max_vals: [25.3]
    # Optimizer kwargs
    optimization:
      early_stop_memory: 50
      weight_local_loss: 0.1
      optim_kwargs:
        lr: 0.002
        weight_decay: 0.0001
      lr_scheduler_kwargs:
        patience: 5
        factor: 0.5
        min_lr: 0.0000001
        verbose: True


Then we configure the optimizer, i.e. the weighting between the loss of global convergence labels and local stellar mass and redshift labels, early-stopping, initial learning rate, learning rate decay.

::

    # Optimizer kwargs
    optimization:
      early_stop_memory: 50
      weight_local_loss: 1.0
      optim_kwargs:
        lr: 0.001
        weight_decay: 0.0001
      lr_scheduler_kwargs:
        patience: 5
        factor: 0.5
        min_lr: 0.0000001
        verbose: True


We then configure the depth and width of the network architecture. The `global flow` key determines whether the final layer predicting the convergence is a normalizing flow.

::

    # Model kwargs
    model:
      dim_local: 50
      dim_global: 50
      dim_hidden: 50
      dim_pre_aggr: 50
      n_iter: 5
      n_out_layers: 5
      dropout: 0.04
      global_flow: False


Lastly, we set more general attributes of training such as the seed, device, and the number of training epochs.

::

    # Trainer attributes
    trainer:
      device_type: 'cuda'
      checkpoint_dir: results/E1
      seed: 1028
    n_epochs: 200
    # If you want to resume training from a checkpoint
    resume_from:
      checkpoint_path: null