Kaggle Competition Report: SETI Breakthrough Listen

August 28, 2021 | 5 min read | 137 views

On August 18, the Kaggle SETI competition that I was participating in came to an end. This competition was hosted by Berkeley SETI Research Center, and participants competed on how accurately they can identify anomalous signals from the tremendous amount of data recorded by the Green Bank Telescope. Since we haven’t observed any signals from extraterrestrials (officially), the dataset was artificially generated by the host.

The Green Bank Telescope, West Virginia, US.

768 teams joined in the end, and I came in 198th (26%).

In this post, I briefly introduce the task for non-participants, and summarize the top 5 teams’ solutions and mine. Finally, I explain why “old leaky data” is included in the dataset and what it is.

Task

The task was to predict whether the spectrograms contained signals from E.T or not. The evaluation metric was AUC. There were 60,000 training samples and 40,000 test samples, 10% of which were positive. Each sample is a set of 6 spectrograms, or a “cadence”. They are continuous in time, but the first, third, and fifth are referred to as “on target” signals that come from the target star, while the rest are “off target” signals from the nearby stars. If there is a pattern in “on target” and no pattern in “off target”, then the signal is a candidate signature of extraterrestrial technology (a positive sample)! If the pattern is also seen in “off target”, it should be a human-generated artifact (a negative sample). Here is an example of positive cadence.

Does it look easy? Unfortunately, real samples are much harder to distinguish. It’s literally like a “needle in a haystack”.

For more information, see Data Information section.

Final Standings

The final standings were quite interesting. The top team “Watercooled” is more than 0.15 ahead of the other teams in AUC. What did they do??

Solutions

Here is a summary of the top five teams’ solutions and mine. They are listed in order of score for consistency, but considering the uniqueness of the 1st place solution, you might want to read the other solutions first and then the 1st place.

1st Place

The team “Watercooled” (@philippsinger, @christofhenkel, and @ilu000) solved the task almost perfectly. The super team examined the CV/LB gap carefully and came up with two “magics” that led them to the victory.

The magic #1: S-shaped signal in the test set

Some models were significantly better than others in LB even though they performed similarly in CV
Good models were able to detect S-shaped signals that only appear in the test set

Injected this kind of signals to the train set using setigen

The magic #2: Removing the background noise

There are common background patterns as discussed here
By retrieving similar images and subtracting the pixel intensity, the background noise was successfully removed

Training details

Did NOT use blending or pseudo labeling
Concatenated 3-channel cadences to 1 channel WITHOUT resizing
NFNet (eca_nfnet_l2) performed better probably because they are sensitive to the distributional differences thanks to its normalizer-free architecture
Trained the single-fold model for 20 epochs
Used the new train set and the old train and test set, both cleaned and non-cleaned
Enlarged the input images by changing the stride of the first convolution layer to (1,2)
Data augmentation (mixup and vertical flip)
Test-time augmentation

2nd Place

@hirune924 won second place with a surprisingly simple solution. It seems to me that every single component was essential for the high score.

1st Stage

Concatenated 6-channel cadences to 1 channel and resize them to 512x512
Trained EfficientNet B5 for 20 epochs (bigger models worked better)
Data augmentation (mixup, vertical flip, and cutout)
Test-time augmentation

2nd Stage

Further trained the model with the pseudo-labeled test set
Stronger regularization (stochastic depth and dropout) were applied to avoid overfitting

3rd Place

@knjcode explored the model architecture.

Concatenated 3-channel cadences to 1 channel and resize them to 768x768
Adopted EfficientNet B1, B2, B3, and B4 as the backbone
Appended a triplet attention module and a Generalized-Mean (GeM) pooling layer to the backbone
Trained the model for 60 epochs
Further trained the model for 10 epochs with the 5,000 pseudo-labeled test samples
Data augmentation (mixup, horizontal/vertical flip, etc)
Test-time augmentation

4th Place

The team “Steven Signal” (@sherlockkay, @kulyabin, and @bakeryproducts) took almost every possible approach, among which mixed sample data augmentations and focal loss were the keys. The team also tried injecting S-shaped signals and cleaning the background noise but couldn’t improve AUC.

Concatenated 6-channel cadences to 1 channel and resize them to 960x768
Trained nf-regnetb1 and HRNet18 with focal loss
Mixed sample data augmentations, such as mixup and cutmix, are effective

5th Place

The team “SETIの壁” (@sinpcw, @kzkt0713, and @sunakuzira) tried many things to tackle the domain shift and reported that SHOT was useful.

Solution

Concatenated 3-channel cadences to 1 channel and resize them to 512x512
Trained NFNet (eca_nfnet_l2) and EfficientNetV2-medium
Further trained the models with the pseudo-labeled test samples (all or only reliable ones)
Unsupervised domain adaptation with SHOT
Data augmentation (mixup, horizontal/vertical flip, etc)
Test-time augmentation

What did not work

Using off-target channels
Focal loss and weighted cross entropy loss
Pretraining with the old leaky data

My Solution (198th Place)

I joined this competition in the early phase and did my best on it. I published a notebook titled “Search for Effective Data Augmentation”, and I got to the bronze zone for the first time in my life. But since the leakage was reported, I had been focusing on the CommonLit Readability Prize competition. I was back one week before the deadline and found the game to be much more difficult than before the reset.

My solution was heavily based on the clean and powerful notebook by @tuckerarrants. It resizes the input images with custom convolution layers before feeding them to the backbone (e.g., EfficientNet) to avoid information loss during the non-parametric interpolation. By replacing EfficientNet B0 with B3 and averaging over 5-fold models, the score increased from 0.745 to 0.768 (I should have tried even bigger models…).

I also tried adversarial validation, but it resulted in poor scores. Given that the AUC of the train/test classification model was 0.999, the two domains overlap too little for adversarial validation to work.

About Old Leaky Dataset

As I mentioned earlier, two kinds of leakage were discovered in the course of the competition, which caused the entire dataset to be reset.

In the early stage, the number of digits displayed on the leaderboard was only to the second decimal place, and we thought AUC=0.99 could be possible, but when it started to display to the third decimal place and the top team’s score was revealed to be 0.999, everyone was convinced that it was leakage. And in fact, there were two kinds of leakage.

What was the Leakage?

First, @kazanova reported that it was possible to identify the target variable from the timestamp and ID of the numpy file.

Second, @nyanpn reported that positives and negatives are easily distinguished just by some simple statistics such as standard deviation.

Fortunately, this competition used simulated data, so the host regenerated a leak-free dataset. The competition was relaunched with an extended deadline.