ICDM 2016 - BARCELONAI'm glad to announce that I'll be organizer and PC-Chair of , which will be held in Barcelona, Spain, December 12-15, 2016. Michael Berthold will be the other PC-Chair, while Ricardo Baeza-Yates will be the General Chair. More information soon.
I'm beyond excited to announce that starting from mid September I'll join the ISI Foundation where I'll create and lead the "Algorithmic Data Analytics" team. At the same time I'll maintain a presence in Barcelona where I'll keep collaborating with the Data Mining team (involving some ex-Yahoos) at the newly founded Technological Center of Catalunya, named Eurecat.
The ISI Foundation is a 30 year old research institute that was founded on the principle of challenging Italy to do the best science possible in an environment without constraints. The mission of ISI is defined by a commitment to world-class interdisciplinary research, to knowledge sharing, and to real-world impact on significant challenges of modern society. It is the hub of a vibrant network of scientists from all over the world pursuing basic research in the physical, mathematical, and computer sciences, as well as fundamental and applied challenges in computational thinking and data-driven science. This is one-of-a-kind environment in the European research landscape. It is a fantastic opportunity for me to start a new research team and to conduct high-impact interdisciplinary research.
Today the ISI Foundation is an international organization based in Torino, Italy and with a new office in New York City: I'm looking forward to visit the Big Apple in the coming months and to engage with the local research community.
LINK PREDICTION WITH EXPLANATIONS
User recommender systems are a key component in any online social networking platform: they help the users growing their network faster, thus driving engagement and loyalty. In this new KDD paper we study link prediction with explanations for user recommendation in social networks. For this problem we propose WTFW (“Who to Follow and Why”), a stochastic topic model for link prediction over directed and nodes-attributed graphs. Our model not only predicts links, but for each predicted link it decides whether it is a “topical” or a “social” link, and depending on this decision it produces a different type of explanation.
Enriching recommendations with explanations has the benefit to increase the trust of the user in the recommendation, and thus the likelihood that the recommendation is adopted. While these benefits are well understood in classic collaborative-filtering recommender systems, providing explanations in the context of user recommendation systems is still largely underdeveloped: in fact, in most of the real-world systems, the unique explanations given for user recommendations are of the type “you should follow user Z because your contacts X and Y do the same”. Instead, our starting observation (and driving motivation) is that a link creation is usually explainable by one of two main reasons: interest identity (topical) or personal relations (social). This observation is rooted in sociology, where it goes under the name common identity and common bond theory.
THE PURSUIT OF A GOOD POSSIBLE WORLD
An uncertain graph, i.e., a graph whose edges are associated with a probability of existence, can be seen as a set of 2^|E| possible worlds. Each possible world is a deterministic instantiation of the uncertain graph. In this new SIGMOD paper we study the problem of selecting just one possible world given an uncertain graph. In particular, we want to select a "good" possible world: one that preserves the underlying graph properties. Starting from the observation that the vertex degree is one of the most fundamental properties of the structure of a graph, we conjecture that by preserving the degree of each vertex we capture the essence of the underlying uncertain graph, and thus accurately approximate other properties. Therefore we study the problem of selecting a deterministic instance of the uncertain graph, which minimizes the sume over all vertices of the difference of their expected degree in the uncertain graph, and the degree in the deterministic instance.
COMMUNITY DETECTION WITHOUT THE NETWORK
While the literature on community detection methods is wide, the applications are mainly limited to data analysis for social sciences. This is due to a fundamental observation: the players which would benefit more from knowing the community structure of a social network, e.g., companies advertising or developing web applications, often do not have access to the whole network! It is a matter of fact that the social network platforms are owned by third party such as Facebook or Twitter, which have realized that their proprietary social graph is an asset of inestimable value. Thus they keep it secret, for sake of commercial competitive advantage, as well as due to privacy legislation.
In this new ICDM'13 paper we tackle the challenging problem of learning communities when not even the social graph is available. Instead what we have is a database of past propagations. We propose a stochastic framework which assumes that the cascades are governed by un underlying diffusion process over the unobserved social network, and that such diffusion model is based on community-level influence. By fitting the model parameters to the user activity log, we learn the community membership and the level of influence of each user in each community. This allows to identify for each community the key users,, i.e., the leaders, which are most likely to influence the rest of the community to adopt a certain item.
Unfortunately the paper has been accepted as a short paper, so only 6 pages. We have much more material to show, especially esperiments on real data. We will soon distribute a longer version of our work. Stay tuned.
DENSER THAN THE DENSEST SUBGRAPH
Motivated by the fact that the direct optimization of edge density is not meaningful, as even a single edge achieves maximum density, research has focused on optimizing alternative density functions. A very popular among such functions is the average degree, whose maximization leads to the well-known densest-subgraph notion. Despite its name, the densest subgraph often results to be a large graph, with small edge density and large diameter.
In this new KDD'13 paper, we define a novel density function, which gives subgraphs of much higher quality than densest subgraphs: the graphs found by our method are compact, dense, and with smaller diameter. The proposed notion of density that we dub optimal quasi-clique, is derived by turning the quasi-clique condition into an objective function: the goal is to maximize the number of edges that are present in the subgraph compared to the number of edges expected under the Erdös-Rényi random-graph model.
THE BANG FOR THE BUCK
In this new KDD'13 paper we study competitive viral marketing from the host perspective. Competitive viral marketing refers to the case in which two or more players compete with comparable products over the same market. By host we mean the social networking platform owner, which offers viral marketing as a service to its clients. From the host’s perspective, it is important not only to choose the seed nodes to target in the campaigns in such a way to maximize the collective expected number of adoptions across all companies, but also to allocate seeds to companies in a way that guarantees the ''bang for the buck'' for all companies is nearly the same. Intuitively, the bang for the buck for a company is the cost benefit ratio between the expected number of adopters of its product over its number of seeds. For more details, please read the paper.
STRIP: STREAM LEARNING OF SOCIAL INFLUENCE
In this new (sexy) KDD'13 paper we introduce STRIP, a suite of streaming methods for computing the strength of social influence along each link of a social network. STRIP allows to compute social influence from the continous stream of actions performed by the users of a social networking platform. Thanks to a wise use of probabilistic approximations, min-wise independent hashing functions, and streaming sliding windows, STRIP can learn influence strength with only one pass over the data and using little memory, thus allowing to scale to big data.
CASCADE-BASED COMMUNITY DETECTION
Understanding how the adoption of new practices, ideas, beliefs, technologies and products can spread trough a population driven by social influuence, is a central issue for the whole of social sciences. Taking into account the modular structure of the underlying social network provides further important insight in the phenomena known as social contagion or information cascades. In particular, individuals tend to adopt the behavior of their social peers, so that cascades happen first locally, within close-knit communities, and become global “viral” phenomena only when they are able cross the boundaries of these densely connected clusters of people. Therefore, the study of social contagion is intrinsically connected to the problem of understanding the modular structure of networks (known as community detection).
Given a directed social graph and a set of past information cascades observed over the graph, in this new WSDM'13 paper (joint work with Nicola Barbieri and Giuseppe Manco) we study the novel problem of detecting modules of the graph (communities of nodes), that also explain the cascades. Our key observation is that both information propagation and social ties formation in a social network can be explained according to the same latent factor, which ultimately guide a user behavior within the network. Based on this observation, we propose the Community-Cascade Network (CCN) model, a stochastic mixture membership generative model that can fit, at the same time, the social graph and the observed set of cascades. Our model produces overlapping communities and for each node, its level of authority and passive interest in each community it belongs.
CHROMATIC CORRELATION CLUSTERING
In this new KDD'12 paper, we (me, Aris, the other Francesco and Antti) introduce a novel clustering problem in which the pairwise relations between objects are categorical, rather than numerical, as it is usually the case in all clustering frameworks.
Our problem can be naturally viewed as partitioning the vertices of a graph whose edges are of different types, or as we like to think about them, different colors. The key objective in our framework is to find a partition of the vertices of the graph such that the edges in each cluster have, as much as possible, the same color (an example is in the picture above). The framework has plenty of applications. As an example, biologists study protein-protein interaction networks, where vertices represent proteins and edges represent interactions occurring when two or more proteins bind together to carry out their biological function. Those interactions can be of different types, e.g., physical association, direct interaction, co-localization, etc. In these networks, for instance, a cluster containing mainly edges labeled as co- localization, might represent a protein complex, i.e., a group of proteins that interact with each other at the same time and place, forming a single multi-molecular machine. As a further example, social networks are commonly represented as graphs, where the vertices represent individuals and the edges capture relationships among these individuals. Again, these relationships can be of various types, e.g., colleagues, neighbors, schoolmates, football-mates. Check out our techniques in the paper.
QUERY RECOMMENDATIONS IN THE LONG TAIL
In a new paper (accepted at SIGIR'12, together with R. Perego, F. Silvestri, R. Venturini, H. Vahabi) we introduce a new method for query recommendation based on the well-known concept of center-piece subgraph, that allows for the time/space efficient generation of suggestions also for rare, i.e., long-tail queries.
We shift the focus from the queries to the terms they contain, thus devising a novel graph-based method that at once solves the two issues: it produces high-quality recommendations also for long-tail queries, achieving a coverage of approximately 99% of a real-world search engine workload, while enjoying high scalability. Our method is based on a graph having term nodes, query nodes, and two kinds of connections: term-query and query-query. The first connect a term to the queries in which it is contained, the second connect two query nodes if the likelihood that a user submits the second query after having issued the first one is sufficiently high. Such graph induces a Markov chain on which a generic random walker starts from the subset of term nodes contained in the incoming query, moves along query nodes, and restarts (with a given probability) only from the same initial subset of term nodes. Computing a recommendation this way corresponds to finding the center-piece subgraph induced by terms contained into the query.
The term-centric perspective we take, not only solves the long-tail coverage issue, but it also allows the stationary distributions of such a Markov chain to be precomputed and stored in a compressed and cached index, thus achieving efficiency and scalability. In particular, we introduce a method based on an inverted index compressed by using a lossy compression method able to reduce by an average of 80% the space occupancy of the uncompressed data structures. Furthermore, caching is exploited at the term-level to enable scalable generation of query suggestions. Being term-based, caching is able to attain hit-ratios of approximately 95% with a reasonable footprint (i.e., few gigabytes of main memory).
INFLUENCE MAXIMIZATION: a data-based approach
While most of the literature on Influence Maximization has focused mainly on the social graph structure, in our new paper (accepted for VLDB'12 [bibtex], together with Amit Goyal and Laks V. S. Lakshmanan) we proposed a novel data-based approach, that directly leverages available traces of past propagations.
Our Credit Distribution model estimates influence spread by exploiting historical data, thus avoiding the need for learning influence probabilities, and more importantly, avoiding costly Monte Carlo simulations, the standard way to estimate influence spread. Based on this, we developed an efficient, effective, and scalable algorithm for the Influence Maximization problem.
Beyond this main contribution, the paper provides several interesting insights in the Influence Maximization problem. Check it out!
MY KEYNOTE TALK AT WI-IAT 2011Please find below the slides I will use in my keynote talk this morning at WI-IAT 2011 in Lyon. The talk is entitled Influence Propagation in Social Networks: a Data Mining Perspective (pdf). If you want the powerpoint, please e-mail me .
SPARSIFICATION OF INFLUENCE NETWORKS [June 2011]
SPINE (SParsification of Influence NEtworks) is a new algorithm for discoverying the "backbone" of an influence network. Given a social graph and a log of past propagations, we build an instance of the independent-cascade model that describes the propagations. We aim at reducing the complexity of that model, while preserving most of its accuracy in describing the data. The paper will be presented next August at the KDD 2011 conference.
#viscous_democracy #castelleres @CACM [June 2011]
Our paper Viscous democracy for social networks, (together with Paolo Boldi, Carlos Castillo and Sebastiano Vigna) has been published in the virtual extension of the June 2011 issue of Communications of ACM. In this article, we propose a middle-ground between direct democracy (citizens vote on every issue) and representative democracy (citizens elect representatives that decide on their behalf on every issue). Our proposal, a type of delegative democracy, allows them to express their opinion directly or to delegate their power on a proxy. Proxy delegation can be transitive: a proxy can delegate in another proxy. However, as our vote travels farther away through a delegation chain, we would like to introduce some reluctance in the way the power is transferred to other people we may not know directly. In that sense, we include a dampening factor (like PageRank does) to reduce the amount of power delegated through long chains. Technically, our system of viscous democracy is a system of transitive proxy voting with exponential damping.
LINK PREDICTION: A RULE-BASED APPROACH [Apr 2011]Recently I've seen a nice survey on Link Prediction covering a lot of different approaches, but overlooking our frequent pattern based approach introduced in:
Bringmann et al. Learning and Predicting the Evolution of Social Networks IEEE Intelligent Systems ©IEEE 25(4):26-35 - Jul/Aug 2010.
Briefly, in this paper we extend our previous work on Mining Graph Evolution Rules, showing how to use those rules for Link Prediction. Our method is quite different from standard Link Prediction, and leverages the history of the network evolution till now to predict future evolution.
Our frequent pattern based approach not only predicts arrival of new edges among pairs of existing nodes, but also allows to predict edges among one existing node and a new node that is not yet part of the graph.
NEW BOOK ON PRIVACY AND MINING [Jan 2011]
Finally published the book that I co-edited with Elena Ferrari in the Data Mining and Knowledge Discovery Series by Chapman & Hall/CRC. The book contains 18 chapters presenting the state-of-the-art of privacy-preserving data mining techniques for application domains, such as biology, medicine, mobility, web and social networks. The chapters are authored by renowned authorities of the privacy preserving data mining field, and not only cover well-established results -- they also explore complex domains where privacy issues are generally clear and well defined, but the solutions are still preliminary and in continuous development. Therefore the book is also a valuable collection of open research problems and roadmaps for further research that can not miss in your library ;-)
Check out Elena's interview with the IEEE Computer Society.
TAXOMO SOFTWARE RELEASED [Jan 2011]I am glad to announce that today we released the TAXOMO sequence mining software under a BSD license. TAXOMO is a data-mining tool for sequences. It takes as input a set of sequences and a taxonomy, and generates a succinct description of the sequences (specifically, a Markov chain with lumped states). The input sequences may represent any kind of data, e.g.: trajectories on a map, web pages visited by a user, etc. The taxonomy should be defined over the states in the sequences. In the case of a map, for instance, they can be regions and sub-regions for the points in the map. In the case of a web site, they can be categories and sub-categories for the pages. Taxomo was developed at Yahoo! Research Barcelona, and it is described in:
Bonchi et al. Taxonomy-driven lumping for sequence mining Data Mining and Knowledge Discovery Journal ©Springer 19(2):227-244 - Oct 2009.
For more information and download, see: http://taxomo.sourceforge.net/
MEME RANK [Dec 2010]I presented in Sydney the Meme Rank paper about a system to maximize virality of memes in a microblogging platform. Check this cool animation representing the propagation of one meme. Nodes indicate users, and the more followers a user has, the darker the node associated to that user. Edges indicate propagations, with thick and short edges indicating same-day re-posts; the longer the edge, the longer the time between re-posts.
ICDM 2010 PANEL [Dec 2010]I was a panelist at ICDM'10. The panel was entitled "Top-10 Data Mining Case Studies". The other panelists were: Ashok Srivasta, Bart Goethals, Rayid Ghani, Christos Faloutsos, Longbing Cao, Osmar Zaiane, Rong Duan, Paul Beinat, and Geoff McLachlan. The panel chairs were Gabor Melli and Xindong Wu.
YAHOO! CLUES LAUNCHED [Nov 2010]
Yahoo! launched Clues, a service based on science from Yahoo! Research Barcelona. Yahoo! Clues lets you explore how people are using Yahoo! Search. When you enter a word or phrase in the "Search Term" field and click Discover, you’ll see information about that search term’s popularity over time, across demographic groups, and in different locations.You can also enter a second search term in the "Compare With" field. This will show you information on both search terms, side by side.
Among other things, it allow users to browse a real query-flow graph:
ME ON TWITTERTweets by @FrancescoBonchi