Francesco BonchiSenior Research Scientist
Yahoo! Research Barcelona
|HOME PUBLICATIONS Y!BCN SEMINARS PROFESSIONAL RESUME CONTACT|
CASCADE-BASED COMMUNITY DETECTION
Understanding how the adoption of new practices, ideas, beliefs, technologies and products can spread trough a population driven by social influuence, is a central issue for the whole of social sciences. Taking into account the modular structure of the underlying social network provides further important insight in the phenomena known as social contagion or information cascades. In particular, individuals tend to adopt the behavior of their social peers, so that cascades happen first locally, within close-knit communities, and become global “viral” phenomena only when they are able cross the boundaries of these densely connected clusters of people. Therefore, the study of social contagion is intrinsically connected to the problem of understanding the modular structure of networks (known as community detection).
Given a directed social graph and a set of past information cascades observed over the graph, in this new WSDM'13 paper (joint work with Nicola Barbieri and Giuseppe Manco) we study the novel problem of detecting modules of the graph (communities of nodes), that also explain the cascades. Our key observation is that both information propagation and social ties formation in a social network can be explained according to the same latent factor, which ultimately guide a user behavior within the network. Based on this observation, we propose the Community-Cascade Network (CCN) model, a stochastic mixture membership generative model that can fit, at the same time, the social graph and the observed set of cascades. Our model produces overlapping communities and for each node, its level of authority and passive interest in each community it belongs.
CHROMATIC CORRELATION CLUSTERING
In this new KDD'12 paper, we (me, Aris, the other Francesco and Antti) introduce a novel clustering problem in which the pairwise relations between objects are categorical, rather than numerical, as it is usually the case in all clustering frameworks.
Our problem can be naturally viewed as partitioning the vertices of a graph whose edges are of different types, or as we like to think about them, different colors. The key objective in our framework is to find a partition of the vertices of the graph such that the edges in each cluster have, as much as possible, the same color (an example is in the picture above). The framework has plenty of applications. As an example, biologists study protein-protein interaction networks, where vertices represent proteins and edges represent interactions occurring when two or more proteins bind together to carry out their biological function. Those interactions can be of different types, e.g., physical association, direct interaction, co-localization, etc. In these networks, for instance, a cluster containing mainly edges labeled as co- localization, might represent a protein complex, i.e., a group of proteins that interact with each other at the same time and place, forming a single multi-molecular machine. As a further example, social networks are commonly represented as graphs, where the vertices represent individuals and the edges capture relationships among these individuals. Again, these relationships can be of various types, e.g., colleagues, neighbors, schoolmates, football-mates. Check out our techniques in the paper.
QUERY RECOMMENDATIONS IN THE LONG TAIL
In a new paper (accepted at SIGIR'12, together with R. Perego, F. Silvestri, R. Venturini, H. Vahabi) we introduce a new method for query recommendation based on the well-known concept of center-piece subgraph, that allows for the time/space efficient generation of suggestions also for rare, i.e., long-tail queries.
We shift the focus from the queries to the terms they contain, thus devising a novel graph-based method that at once solves the two issues: it produces high-quality recommendations also for long-tail queries, achieving a coverage of approximately 99% of a real-world search engine workload, while enjoying high scalability. Our method is based on a graph having term nodes, query nodes, and two kinds of connections: term-query and query-query. The first connect a term to the queries in which it is contained, the second connect two query nodes if the likelihood that a user submits the second query after having issued the first one is sufficiently high. Such graph induces a Markov chain on which a generic random walker starts from the subset of term nodes contained in the incoming query, moves along query nodes, and restarts (with a given probability) only from the same initial subset of term nodes. Computing a recommendation this way corresponds to finding the center-piece subgraph induced by terms contained into the query.
The term-centric perspective we take, not only solves the long-tail coverage issue, but it also allows the stationary distributions of such a Markov chain to be precomputed and stored in a compressed and cached index, thus achieving efficiency and scalability. In particular, we introduce a method based on an inverted index compressed by using a lossy compression method able to reduce by an average of 80% the space occupancy of the uncompressed data structures. Furthermore, caching is exploited at the term-level to enable scalable generation of query suggestions. Being term-based, caching is able to attain hit-ratios of approximately 95% with a reasonable footprint (i.e., few gigabytes of main memory).
INFLUENCE MAXIMIZATION: a data-based approach
While most of the literature on Influence Maximization has focused mainly on the social graph structure, in our new paper (accepted for VLDB'12 [bibtex], together with Amit Goyal and Laks V. S. Lakshmanan) we proposed a novel data-based approach, that directly leverages available traces of past propagations.
Our Credit Distribution model estimates influence spread by exploiting historical data, thus avoiding the need for learning influence probabilities, and more importantly, avoiding costly Monte Carlo simulations, the standard way to estimate influence spread. Based on this, we developed an efficient, effective, and scalable algorithm for the Influence Maximization problem.
Beyond this main contribution, the paper provides several interesting insights in the Influence Maximization problem. Check it out!
MY KEYNOTE TALK AT WI-IAT 2011Please find below the slides I will use in my keynote talk this morning at WI-IAT 2011 in Lyon. The talk is entitled Influence Propagation in Social Networks: a Data Mining Perspective (pdf). If you want the powerpoint, please e-mail me .
SPARSIFICATION OF INFLUENCE NETWORKS [June 2011]
SPINE (SParsification of Influence NEtworks) is a new algorithm for discoverying the "backbone" of an influence network. Given a social graph and a log of past propagations, we build an instance of the independent-cascade model that describes the propagations. We aim at reducing the complexity of that model, while preserving most of its accuracy in describing the data. The paper will be presented next August at the KDD 2011 conference.
#viscous_democracy #castelleres @CACM [June 2011]
Our paper Viscous democracy for social networks, (together with Paolo Boldi, Carlos Castillo and Sebastiano Vigna) has been published in the virtual extension of the June 2011 issue of Communications of ACM. In this article, we propose a middle-ground between direct democracy (citizens vote on every issue) and representative democracy (citizens elect representatives that decide on their behalf on every issue). Our proposal, a type of delegative democracy, allows them to express their opinion directly or to delegate their power on a proxy. Proxy delegation can be transitive: a proxy can delegate in another proxy. However, as our vote travels farther away through a delegation chain, we would like to introduce some reluctance in the way the power is transferred to other people we may not know directly. In that sense, we include a dampening factor (like PageRank does) to reduce the amount of power delegated through long chains. Technically, our system of viscous democracy is a system of transitive proxy voting with exponential damping.
LINK PREDICTION: A RULE-BASED APPROACH [Apr 2011]Recently I've seen a nice survey on Link Prediction covering a lot of different approaches, but overlooking our frequent pattern based approach introduced in:
Bringmann et al. Learning and Predicting the Evolution of Social Networks IEEE Intelligent Systems ©IEEE 25(4):26-35 - Jul/Aug 2010.
Briefly, in this paper we extend our previous work on Mining Graph Evolution Rules, showing how to use those rules for Link Prediction. Our method is quite different from standard Link Prediction, and leverages the history of the network evolution till now to predict future evolution.
Our frequent pattern based approach not only predicts arrival of new edges among pairs of existing nodes, but also allows to predict edges among one existing node and a new node that is not yet part of the graph.
NEW BOOK ON PRIVACY AND MINING [Jan 2011]
Finally published the book that I co-edited with Elena Ferrari in the Data Mining and Knowledge Discovery Series by Chapman & Hall/CRC. The book contains 18 chapters presenting the state-of-the-art of privacy-preserving data mining techniques for application domains, such as biology, medicine, mobility, web and social networks. The chapters are authored by renowned authorities of the privacy preserving data mining field, and not only cover well-established results -- they also explore complex domains where privacy issues are generally clear and well defined, but the solutions are still preliminary and in continuous development. Therefore the book is also a valuable collection of open research problems and roadmaps for further research that can not miss in your library ;-)
Check out Elena's interview with the IEEE Computer Society.
TAXOMO SOFTWARE RELEASED [Jan 2011]I am glad to announce that today we released the TAXOMO sequence mining software under a BSD license. TAXOMO is a data-mining tool for sequences. It takes as input a set of sequences and a taxonomy, and generates a succinct description of the sequences (specifically, a Markov chain with lumped states). The input sequences may represent any kind of data, e.g.: trajectories on a map, web pages visited by a user, etc. The taxonomy should be defined over the states in the sequences. In the case of a map, for instance, they can be regions and sub-regions for the points in the map. In the case of a web site, they can be categories and sub-categories for the pages. Taxomo was developed at Yahoo! Research Barcelona, and it is described in:
Bonchi et al. Taxonomy-driven lumping for sequence mining Data Mining and Knowledge Discovery Journal ©Springer 19(2):227-244 - Oct 2009.
For more information and download, see: http://taxomo.sourceforge.net/
MEME RANK [Dec 2010]I presented in Sydney the Meme Rank paper about a system to maximize virality of memes in a microblogging platform. Check this cool animation representing the propagation of one meme. Nodes indicate users, and the more followers a user has, the darker the node associated to that user. Edges indicate propagations, with thick and short edges indicating same-day re-posts; the longer the edge, the longer the time between re-posts.
ICDM 2010 PANEL [Dec 2010]I was a panelist at ICDM'10. The panel was entitled "Top-10 Data Mining Case Studies". The other panelists were: Ashok Srivasta, Bart Goethals, Rayid Ghani, Christos Faloutsos, Longbing Cao, Osmar Zaiane, Rong Duan, Paul Beinat, and Geoff McLachlan. The panel chairs were Gabor Melli and Xindong Wu.
YAHOO! CLUES LAUNCHED [Nov 2010]
Yahoo! launched Clues, a service based on science from Yahoo! Research Barcelona. Yahoo! Clues lets you explore how people are using Yahoo! Search. When you enter a word or phrase in the "Search Term" field and click Discover, you’ll see information about that search term’s popularity over time, across demographic groups, and in different locations.You can also enter a second search term in the "Compare With" field. This will show you information on both search terms, side by side.
Among other things, it allow users to browse a real query-flow graph:
ME ON TWITTER
|WEB DESIGN CREDITS|