Link Prediction on Distributed Environment: A Comparison between Support Vector Machines and Neural Networks

Vansh Gupta
3 min readApr 25, 2023


Websites are a ubiquitous aspect of our everyday lives in the modern world, from social media to itinerary. The behaviour of these websites may be better understood and decision-making aided by analysing them and forecasting interactions. Machine learning techniques, such as Support Vector Machines (SVMs) and Neural Networks (NNs), are one prominent way for network analysis. However, distributed computing platforms like Apache Spark are needed to analyse massive networks with rising data sizes in order to achieve scalability and efficiency. The effectiveness of SVMs and NNs for distributed environment correlation prediction in stored graphs will be compared in this blog article.

What is Link Prediction?

Link prediction is a technique for foretelling if there will be edges between nodes in a network. It may be applied to many different tasks, such as fraud detection, social network analysis, and recommendation processing. The premise behind link prediction is that nodes with comparable structural characteristics are more likely to be linked.

Support Vector Machines (SVMs)

A Supervised Learning method called SVM is utilised for regression and classification analysis. SVMs divide the data into classes by constructing a hyperplane. SVMs have been effectively used to forecast connections in a variety of applications, such as social network analysis and bioinformatics. In connection prediction, SVMs are used to determine the function of changing the structural properties of nodes according to the likelihood of edges connecting them.

Neural Networks (NNs)

The structure and operation of the human brain served as the inspiration for the machine learning algorithm known as neural networks, or NNs. An NN is a collection of interconnected nodes, where each node processes its inputs mathematically and generates outputs. The use of NN for connection prediction has proven effective in a number of applications, such as recommender systems and social network analysis, where it is used to develop a function that translates the structural properties of nodes to the likelihood of an edge connecting them.

Distributed Computing Environment

The capacity to analyse massive volumes of data in a distributed and parallel fashion is offered by environments for distributed computing like Apache Spark. Spark processes data using clusters of processors, with each unit processing a tiny subset of the input and combining the findings into a single output. This makes it possible to ingest data sets that would otherwise be impossible to process on a single system.

Link Prediction on Distributed Environment

We conducted an investigation on a graph dataset saved using Apache Spark to evaluate the performance of SVMs and NNs in a distributed context for link prediction. A social network with 1.5 million nodes and 6.4 million edges makes up the dataset. For the purpose of predicting correlations, we considered the following systematic characteristics.

1. Common Neighbours: The quantity of nodes that have a common neighbour.

2. Jaccard coefficient: The proportion of shared neighbours to all neighbours.

3. The Adamic-Adar index, which resembles the Jaccard coefficient but gives rarer neighbours more weight.

4. Preferred adjacency: sum of two nodes’ degrees.

The dataset was split into training and testing sets, and the SVM and NN models were trained using the training set. By measuring a model’s capacity to distinguish between positive and negative models, the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) was used to assess the performance of various models during the testing phase.


The results of the study show that neural network (NN) and support vector machine (SVM) models both predict connections in a scattered environment effectively. The NN model’s accuracy was 89%, compared to the SVM model’s 87%. exactly. The SVM model’s accuracy score was 0.90, whereas the NN model’s accuracy score was 0.92. With recall scores of 0.85 for the SVM and 0.88 for the NN, both models performed quite well in terms of memory. The analysis also showed that both models perform better when more nodes are added to the graph. This suggests that the models can scale to large data sets in a distributed setting.

Overall, the results show that SVM and NN models are both useful methods for predicting networks in dispersed settings. In terms of accuracy and precision, the NN model performs better than the SVM model, but ultimately the choice between the two models will rely on the specifics of the task at hand.