Unveiling GELATO: A Novel Approach to Sparse Link Prediction

By João Mattos, Zexi Huang, Mert Kosan, Ambuj Singh, and Arlei Silva

KDD 2025 Oral Presentation Highlight

Link prediction, the task of identifying missing or potential connections in graphs, is a cornerstone of modern machine learning with applications spanning recommendation systems, protein-protein interaction networks, and social media analysis. However, the real world often presents a significant challenge: graphs are sparse. In these scenarios, the vast majority of possible connections do not exist, leading to extreme class imbalance and rendering traditional link prediction methods ineffective. Our recent work, presented at KDD 2025, introduces GELATO, a novel framework designed to overcome these limitations and deliver superior performance in sparse link prediction.

Biased sampling in link prediction — *Figure 1. Traditional biased sampling with equal positives and negatives.*

Real-world graph sparsity — *Figure 2. Real-world graphs are extremely sparse, with negatives far outweighing positives.*

The Challenge of Sparse Link Prediction

Traditional link prediction methods often rely on a biased sampling approach (first image above), where an equal number of existing (positive) and non-existing (negative) links are used for training. This approach, however, fails to capture the reality of most real-world graphs, which are incredibly sparse. In a sparse graph, the number of non-existent links vastly outweighs the number of existing ones (see second image above). This imbalance poses a significant challenge for machine learning models, which can become overwhelmed by the majority class (negative links) and fail to learn the subtle patterns that define the minority class (positive links).

Our research has identified two key limitations of state-of-the-art link prediction methods when confronted with sparse graphs:

The “Hard Negative” Problem: Existing methods struggle to distinguish between positive links and “hard negatives.” These are non-existent links that share many characteristics with positive links, such as being located within the same community in a social network. The inability to differentiate these hard negatives from true positives is a major bottleneck for accurate link prediction.
Binary Classification vs. Ranking: Many current methods treat link prediction as a binary classification problem (a link either exists or it doesn’t). This approach fails to capture the nuances of link prediction in the real world, where the goal is often to rank potential links by their likelihood. By treating all negative links as equal, these methods miss out on valuable information that could be used to improve their ranking performance.

Illustration of ranking vs classification — *Figure 3. Ranking-based link prediction captures nuances missed by binary classification.*

Introducing GELATO: A New Recipe for Link Prediction

Overview of GELATO framework — *Figure 4. The GELATO framework: unbiased sampling, attribute-enhanced learning, autocovariance, and ranking loss.*

To address these challenges, we developed GELATO, a novel framework that reframes link prediction as a ranking problem and incorporates several key innovations:

Unbiased Negative Sampling: GELATO starts by correcting the sampling bias inherent in traditional methods. By sampling negative links in proportion to their actual occurrence in the graph, we train our model on a more realistic distribution of data, preparing it for the challenges of real-world sparse graphs.
Attribute-Enhanced Graph Learning: GELATO leverages the features of each node to create an enhanced representation of the graph. By passing node features through a multi-layer perceptron (MLP), we generate learned edge weights that capture the similarity between nodes. These learned weights are then combined with the original graph structure to create an enhanced adjacency matrix, providing a richer source of information for link prediction.
Autocovariance for Hard Negative Distinction: At the heart of GELATO is the use of a topological heuristic called Autocovariance. This powerful tool allows GELATO to capture the community structure of the graph and, crucially, to distinguish between hard negatives and positive links. This is a significant advantage over existing methods that often struggle with this critical task.
N-pair Loss for Improved Ranking: Finally, GELATO replaces the traditional binary cross-entropy loss with the N-pair loss. This change is fundamental to treating link prediction as a ranking problem. The N-pair loss encourages the model to assign a higher score to positive links than to their corresponding negative counterparts, leading to a more accurate and robust ranking of potential links.

Scaling GELATO for the Real World

Scalability challenges in GELATO — *Figure 5. Scaling GELATO with hard negatives and sparse matrix operations.*

While GELATO’s approach is powerful, applying it to massive real-world graphs requires careful consideration of scalability. To address this, we have developed several techniques to make GELATO efficient and practical for large-scale applications:

Focusing on Hard Negatives: For large graphs, we can significantly reduce the computational cost of unbiased sampling by focusing on the most informative negative samples: the hard negatives. This allows us to maintain high performance while dramatically reducing training times.
Sparse Matrix Operations: To overcome the memory challenges associated with Autocovariance, we leverage sparse matrix operations. This allows us to work with a smaller, more manageable representation of the Autocovariance matrix, making it feasible to apply GELATO to graphs with millions of nodes.

GELATO in Action: Superior Performance and Key Insights

Our experiments demonstrate that GELATO consistently outperforms state-of-the-art link prediction methods, particularly in sparse graph settings. GELATO not only excels at distinguishing between hard negatives and positive links but also highlights the surprising effectiveness of similarity-based learners and classic heuristics, which can even outperform more complex GNN-based methods in certain scenarios.

Our work with GELATO has yielded several key insights:

Similarity is Key: Treating link prediction as a similarity task, as GELATO does, is a more effective approach than traditional classification-based methods.
Simplicity Can Win: Simple, lightweight, and similarity-based methods can outperform complex GNNs, especially in the face of extreme class imbalance.
Local Structure Matters: Exploring the local structure of the graph is crucial for accurate sparse link prediction.
Hard Negatives are Informative: The ability to identify and leverage hard negative samples is a key differentiator for high-performing link prediction models.

In conclusion, GELATO offers a robust, scalable, and highly effective solution for sparse link prediction. By rethinking the fundamental assumptions of traditional link prediction methods and introducing a novel set of techniques, GELATO paves the way for more accurate and reliable link prediction in a wide range of real-world applications.

We invite you to read our full paper for a more detailed discussion of GELATO and our experimental results.

Full Paper | Source Code