This working paper describes developments on the IOTC bigeye reference set and robustness test operating models since the 2018 WPTT and WPM, with some summary MP evaluation results. Due to time constraints, and a pervasive configuration error in the conditioning files (that was identified late in the process), the comparison of fractional grids and repeated convergence issues is based on flawed models, but the generic inferences are expected to remain valid.
Key points include:
• It was not possible to produce an alternative BET growth curve that sensibly merged the western Indian Ocean tag growth increment data with the eastern Indian Ocean otolith data using the existing statistical approaches (i.e. the data are too incompatible). An alternative ad hoc growth curve was produced by combining the two growth curves with a high weighing on the Western curve for younger ages and a high weighting on the Eastern curve for older ages. When combined with the higher CL sample size assumption and low M, the ad hoc growth curve was associated with implausible population dynamics (poor fit to early CPUE combined with dubiously high initial depletion). This suggests that growth uncertainty may well be important, but we omitted this scenario from further investigation because it is not a defensible scenario.
• Additional attention was given to the issue of numerical stability and model convergence in 2019. Instead of simply rejecting models that failed to meet the gradient-based convergence criterion, the minimization was automatically repeated from jittered initial parameter values, until convergence (maximum absolute gradient < 0.01) was achieved (or at least 10 minimization failures occurred). All configurations were able to meet this criteria for BET (though this was not the case for YFT).
• The automated minimization was also used to replicate (3 times) convergence to examine minimization sensitivity to the in initial conditions. Within a model configuration, the standard deviation of the final objective function was ~20 likelihood units (with several values of 100-1000+). However, the CV of stock status characteristics (MSY and B/BMSY) within a configuration were an order of magnitude smaller than the stock status variability among models (based on the lowest likelihood attained within each configuration). There was no evidence that a better likelihood was associated with a lower gradient among those models that reached the gradient threshold. Only the lowest objective function model was retained for the OMs.
• A 288 model ensemble was initially intended as the reference case OM for this meeting (but subsequently found to contain an error in the application of the regional scaling factors and CPUE weightings). Comparing the full factorial grid (288 model) with fractional grids, we note that:
o The fractional 144 model grid (which would allow all main effects and two way interactions to be estimable) appears essentially identical in terms of stock status estimates.
o The fractional 72 model grid (with only main effects estimable) resembles the full grid in terms of stock status estimates except that the distribution is somewhat polymodal.
o The MP evaluation graphics are virtually indistinguishable among the three grids. i.e. It appears that the MP selection process would have likely reached the same conclusion regardless of how many models were in the OM grid.
Conditioning and MP results are presented for OMgridB19.5 - a 7 factor fractional factorial grid of 144 models. We propose retaining the grid for the TCMP 2019 reference case OM, subject to feedback from the MSE Task Force with respect to:
o factors and levels to include
o fractional factorial design (main effects + 2-way interactions proposed)
o model plausibility filtering (notably with respect to the SS3 catch penalty)
• The reference case MP evaluation performance appears very similar to the previous iteration, and the tuning objectives set by the 2018 TCMP appear to cover a reasonable range of sensible behaviour.
• A number of robustness scenarios are presented, which degrade the performance of the MPs in a qualitatively predictable manner. It is not clear that these results are all plausible or helpful for the purposes of MP selection. We propose not to present any of them to the 2019 TCMP.
Summary points are presented for discussion and/or endorsement from the IOTC MSE task force.
The issue of evaluating plausibility of models within a large grid remains unresolved. We speculate that the current diagnostics and ad hoc inspection process should identify gross outlier behaviour in the system features that are likely to be relevant for MP evaluation, however, we expect that undesirable characteristics might be evident in some models if they were explored in detail.