MIT/Tuebingen Saliency Benchmark

Evaluation

We evaluate models using seven metrics: AUC, shuffled AUC, Normalized Scanpath Salience (NSS), Correlation Coefficent (CC), Similarity (SIM) and KL-Divergence. The evaluations are implemented in the pysaliency python library and called with the code available here.

For probabilistic models, we compute correct saliency maps for the evaluated metrics and evaluate those metrics. More precisely, each metric is evaluated with the saliency map which the model itself predicts to have highest metric performance. This will result in models being scored fairly in all metrics and therefore have higher scores in some metrics than classic saliency map models. For more details, check Kümmerer et al., Saliency Benchmarking Made Easy: Separating Models, Maps and Metrics [ECCV 2018].

Note that when reimplementing the evaluation, we tried to make some details more principled. Therefore, there are a few inconsistencies between the original evaluation code of the MIT Saliency Benchmark and how the saliency maps are computed and evaluated here:

In the original MIT Saliency Benchmark, multiple fixations at the same image location are treated as only one fixation in that location. This happens quite a few times (380 times in MIT1003 and more often on MIT300 due to the higher number of subjects per image) and gives rise to artefacts especially in the case of many fixations. Here, each fixation contributes equally to the performance on a specific image.
the original MIT Saliency Benchmark uses 8 cycles/image for computing gaussian convolutions and does so via the Fourier domain, i.e. with zero-padding the image to be square and then cyclic extension. according to the paper, 8 cycles/image corresponds to 1 dva or about 35pixel and therefore we use a Gaussian convolution with 35 pixels and nearest padding (which shouldn't make a lot of a difference due to the sparse fixations)
We don't have correct saliency maps for the Earth Mover's Distance yet since for this metric there is no analytic solution of the optimization problem which saliency map has the highest expected metric performance under the predicted fixation distribution.
While the AUC_Judd metric of the original MIT Saliency Benchmark uses all pixels that are not fixated as nonfixations, we use all pixels as nonfixations. We consider this more principled since it behaves better in the limit of many fixations.
Originally, the AUC_Judd metric added some random noise on the saliency map to make sure there are no pixels with the same saliency value, since the AUC implementation could not handle that case. Our implementation can handle this case (including a constant saliency map that will result in an AUC score of 0.5), and therefore the noise is not needed anymore.
In the MIT Saliency Benchmark, the shuffled AUC metric took the fixations of 10 other images, removed doublicate fixation locations among them, 100 times choose a random subset of those that is as big as the number of actual fixations and computed the AUC score between the fixations and those nonfixations and finally averaged the scores. Instead, we just take all fixation locations of all other images as nonfixations. As for the normal AUC, fixations and nonfixations can have repeated locations, which here is even more important than for the normal AUC due to the higher fixation density in the image center.
So far, the MIT Saliency Benchmark accepted saliency maps of arbitrary sizes and rescaled them to the size of the input images. However, the resaling operation has a lot of parameters that can be choosen (nearest neighbour upsampling, kubic, etc) and that might affect the metric performance especially of metrics that are very sensitive in some scale regions (e.g. KLDiv). This choice should be made by the submitting researcher. Therefore, new submissions have to have saliency maps of the same size as the evaluated images
In the MIT Saliency Benchmark, the similarity metric rescaled each saliency map to range from 0 to 1 before dividing by the sum to make it a distribution. Since there is no reason why at least on pixel should have a probability of zero, this is not done anymore. Instead the original saliency map is directly divided by its sum.