Because Online Social Networks (OSNs) have become increasingly important in the last decade, they have motivated a great deal of research on Social Network Analysis (SNA). Currently, SNA algorithms are evaluated on real datasets obtained from large-scale OSNs, which are usually sampled by Breadth-First-Search (BFS), Random Walk (RW), or some variations of the latter. However, none of the released datasets provides any statistical guarantees on the difference between the sampled datasets and the ground truth. Moreover, all existing sampling algorithms only focus on sampling a single OSN, but each OSN is actually a sampling of a complete social network. Hence, even if the whole dataset from a single OSN is sampled, the results may still be skewed and may not fully reflect the properties of the complete social network. To address the above issues, we have made the first attempt to explore the joint sampling of multiple OSNs and propose an approach called Quality-guaranteed Multi-network Sampler (QMSampler) that can jointly sample multiple OSNs. QMSampler provides a statistical guarantee on the difference between the sampled real dataset and the ground truth (the perfect integration of all OSNs). Our experimental results demonstrate that the proposed approach generates a much smaller bias than any existing method. QMSampler has also been released as a free download.
An OSN can be considered as a sampling of the complete social network because the friends of each person in the OSN are only a subset of that person’s friends in the world. Therefore, even if an SNA algorithm samples all the nodes in a single OSN, the results may still apply to only one network and may not fully reflect the properties of the complete social network.
For example, the friends on Flickr are usually the ones with similar hobbies, but may be a poor representation of working contacts in LinkedIn. Therefore, the graph properties of different OSNs are usually different. On the other hand, the social influence model, which is widely used in viral marketing, advertisement targeting, and information diffusion, may not be precise if the set of edges incident to each node is incomplete as there is a tendency to underestimate the node’s activation probability from the influence of friends.
We propose a new framework for jointly sampling multiple OSNs with a quality guarantee, in order to generate a more complete network with each node representing a user and each edge connecting two nodes representing the acquaintance of two nodes online. Our first goal is to provide statistical guarantees on the difference between the sampled (and then merged or integrated) real datasets and the ground truth. The ground truth in this paper is defined as the perfect integration of all the OSNs considered. That is, all the nodes and edges in the OSNs are included, and the nodes corresponding to the same person in different OSNs are correctly merged and integrated. The difference is the gap between the graph characteristic metrics of the generated graph and those of the ground truth.