ParisTech 1st Data Science Game – May-June 2015

DataScienceGame

ENSAE ParisTech and ParisTech, along with Ensta ParisTech and Telecom ParisTech, invite all data science students from Universities all around the world to participate to the 1st edition of the Data Science Game.

By solving a data driven issue, students will be able to enlighten their data science expertise in a both competitive and friendly spirit.

The competition is supported by two major partners : Google Inc., who will provide the scope and material of the competition, and Capgemini, who will provide an amazing setting for the competition in Paris.

Because data are both major input and output in our connected lives, because data science students are the builders of tomorrow and because we believe that they deserve
to be in the limelight, we encourage you to come and join this first international data science event in Paris. Build a team, handle data provided by our partner, try to answer very challenging questions and demonstrate yours skills among data science students from all around the world.

A two Phases competition:

  • An online non-eliminatory phase from mid-May to mid-June 2015
  • A two-days competition in Paris, the 20th and 21st of June 2015

More information, schedule and registration on www.datasciencegame.com

Real-Time Big Data Stream Analytics – Seminar by ​Albert Bifet on April 30th

​Albert Bifet (http://albertbifet.com) will be invited by the Big Data & Market Insights Chair to give a talk on Thursday, April 30th at the National University of Singapore (NUS) School of Computing, Computer Science department.

data-stream

Big Data is a new term used to identify datasets that we cannot manage with current methodologies or data mining software tools due to their large size and complexity. Big Data mining is the capability of extracting useful information from these large datasets or streams of data. New mining techniques are necessary due to the volume, variability, and velocity, of such data. In this talk, we will focus on advanced techniques in Big Data mining in real time using evolving data stream techniques: using a small amount of time and memory resources, an being able to adapt to changes. We will discuss some advanced state-of-the-art methodologies in stream mining based in the use of adaptive size sliding windows. Finally, we will present the MOA software framework with classification, regression, and frequent pattern methods, and the new Apache SAMOA distributed streaming software.

Albert.Bifet.250x250Dr. Albert Bifet is a Senior Researcher at Huawei. He is the author of a book on Adaptive Stream Mining and Pattern Learning and Mining from Evolving Data Streams. His main research interest is in Learning from Data Streams. He published more than 60 articles. He is serving as Industrial Track co-Chair of ECM-PKDD 2015. He is one of the leaders of MOA and Apache SAMOA software environments for implementing algorithms and running experiments for online learning from evolving data streams. He has been Co-Chair of BigMine (2015, 2014, 2013, 2012), and ACM SAC Data Streams Track (2015, 2014, 2013, 2012).

Telecom ParisTech will host the 2015 ASONAM Conference

asonam

The 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) will be held in Paris, from August 25th to 28th. About 300 social network analysis experts are expected at Telecom ParisTech, in the 13th arrondissement.

The study of social networks originated in social and business communities. In recent years, social network research has advanced significantly; the development of sophisticated techniques for Social Network Analysis and Mining (SNAM) has been highly influenced by the online social Web sites, email logs, phone logs and instant messaging systems, which are widely analysed using graph theory and machine learning techniques.

People increasingly perceive the Web as a social medium that fosters interaction among people, sharing of experiences and knowledge, group activities, community formation and evolution. This has led to a rising prominence of SNAM in academia, politics, homeland security and business.

The ASONAM 2015 will primarily provide an interdisciplinary venue that will bring together practitioners and researchers from a variety of SNAM fields to promote collaborations and exchange of ideas and practices. The conference solicits both experimental and theoretical works on social network analysis and mining along with their application to real life situations.

More details on http://asonam.cpsc.ucalgary.ca/2015

Groupe BPCE joins the Big Data & Market Insights Chair

Groupe BPCE is joining Groupe Yves Rocher, Voyages-sncf.com and Deloitte as a partner of Télécom ParisTech’s Big Data & Market Insights Research Chair launched with Télécom Business School in December 2013. The Chair’s inter-disciplinary work is geared to improving companies’ knowledge of their clients, to helping them to personalise products and services and to develop techniques for preventing IT fraud and intrusions.

“We are delighted to welcome Groupe BPCE as a partner of the Big Data & Market Insights Chair. Firstly, like our existing partners, Groupe BPCE recognises the major challenges raised by big data and the interest of joining forces with a specialist research team in order to maximise the understanding and use of this data both for the benefit of the Group and of its clients. And secondly, the fact that our partners come from different sectors of activity enables us to enhance our knowledge of the various business issues and needs linked to big data and to develop effective solutions tailored to these individual issues and needs. Groupe BPCE’s entry into the Chair means we can incorporate the needs of the banking and insurance industry into our research work” underlines Talel Abdessalem, the Chair holder.

Download the Press Release (PDF)

Seminar: Platforms and Applications for “Big and Fast” Data Analytics

On Wednesday, January 14th at 2 pm, the Big Data & Market Insights Chair will welcome Yanlei Diao, Associate Professor at the Department of Computer Science of the University of Massachusetts, Amherst.

The seminar will be held at the LINCS, 23 avenue d’Italie, 75013 Paris (See directions here), in the Salle du Conseil.

Talk overview

Recently there has been a significant interest in building big data systems that can handle not only “big data” but also “fast data” for analytics. Our work is strongly motivated by recent real-world case studies that point to the need for a general, unified data processing framework to support analytical queries with different latency requirements. Towards this goal, our project is designed to transform the popular MapReduce computation model, originally proposed for batch processing, into distributed (near) real-time processing.

In this talk, I start by examining the widely used Hadoop system and presenting a thorough analysis to understand the causes of high latency in Hadoop. I then present a number of necessary architectural changes, as well as new resource configuration and optimization techniques to meet user-specified latency requirements while maximizing throughput.

Experiments using typical workloads in click stream analysis and Twitter feed analysis show that our techniques reduce the latency from tens or hundreds of seconds in Hadoop to sub-second in our system, with 2x-7x increase in throughput. Our system also outperforms state-of-the-art distributed stream systems, Twitter Storm and Spark Streaming, by a wide margin. Finally, I will show some initial results and challenges of supporting big and fast data analytics in the emerging domain of genomics.

Diao-academic-tinyBiography

Yanlei Diao is Associate Professor of Computer Science at the University of Massachusetts Amherst. Her research interests are in information architectures and data management systems, with a focus on big data analytics, data streams, uncertain data management, and RFID and sensor data management. She received her PhD in Computer Science from the University of California, Berkeley in 2005, her M.S. in Computer Science from the Hong Kong University of Science and Technology in 2000, and her B.S. in Computer Science from Fudan University in 1998.

Yanlei Diao was a recipient of the 2013 CRA-W Borg Early Career Award (one female computer scientist selected each year), IBM Scalable Innovation Faculty Award, and NSF Career Award, and she was a finalist of the Microsoft Research New Faculty Award. She spoke at the Distinguished Faculty Lecture Series at the University of Texas at Austin. Her PhD dissertation “Query Processing for Large-Scale XML Message Brokering” won the 2006 ACM-SIGMOD Dissertation Award Honorable Mention.

She is currently Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, Area Chair of SIGMOD 2015, and member of the SIGMOD Executive Committee and SIGMOD Software Systems Award Committee. In the past, she has served as Associate Editor of PVLDB, organizing committee member of SIGMOD, CIDR, DMSN, and the New England Database Summit, as well as on the program committees of many international conferences and workshops. Her research has been strongly supported by industry with awards from Google, IBM, Cisco, NEC labs, and the Advanced Cybersecurity Center.

 

Thesis defense: Intelligent Content Acquisition in Web Archiving

On Wednesday, December 17th at Telecom ParisTech, at 2 pm in the Amphi Grenat, Muhammad Faheem will defend his thesis on Intelligent Content Acquisition in Web Archiving. Here is the abstract:

Web sites are dynamic in nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We present in this thesis intelligent systems that crawl the Web in Intelligent manner.

The application-aware helper (AAH), fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications. Because the AAH is aware of the Web application currently crawled, it is able to refine the list of URLs to process and to extend the archive with semantic information about extracted content. The AAH has introduced a semi-automatic crawling approach that relies on hand-written description of known Web sites.

We also propose a fully-automatic system that does not require any human intervention to crawl the Web pages. We introduce ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler (fully automatic) that utilizes the inner structure of the Web pages and guides the crawling process based on the importance of their content.

A large part of the information on the Web is hidden behind Web forms (known as the deep Web, or invisible Web, or hidden Web). The above stated systems does not crawl the hidden Web pages. To address this problem, we propose OWET (Open Web Extraction Toolkit) as such a platform, a free, publicly available data extraction framework.

Thesis defense: Large scale recommender systems

On Monday, December 15th, at 2 pm, Modou Gueye will defend his thesis ​at Telecom ParisTech, in room B312. Here is the abstract:

In this thesis, we address the scalability problem of recommender systems. We propose accurate and scalable algorithms.
We first consider the case of matrix factorization techniques in a dynamic context, where new ratings are continuously produced. In such case, it is not possible to have an up to date model, due to the incompressible time needed to compute it. This happens even if a distributed technique is used for
matrix factorization. At least, the ratings produced during the model computation will be missing. Our solution reduces the loss of the quality of the recommendations over time, by introducing some stable biases which track users’ behavior deviation. These biases are continuously updated with the new
ratings, in order to maintain the quality of recommendations at a high level for a longer time.

We also consider the context of online social networks and tag recommendation. We propose an algorithm that takes into account the popularity of the tags and the opinions of the users’ neighborhood. But, unlike common nearest neighbors’ approaches, our algorithm does not rely on a fixed number of neighbors while computing a recommendation. It uses a heuristic that bounds the network traversal in a way that enables computing the recommendations on the fly, with a limited computation cost, while preserving the quality of the recommendations.

Finally, we propose a novel approach that improves the accuracy of the recommendations for top-k algorithms. Instead of a fixed list size, we adjust the number of items to recommend in a way that optimizes the global accuracy of the recommendations. We other words, we optimize the likelihood that all the recommended items will be chosen by the user, and find the best candidate sublist (i.e., the most accurate one) to recommend to the user.

General Objectives

Featured

The Chair is linked to the research activity of two research groups in the Institute Mines-Telecom: the IC2/DBWeb team of Telecom ParisTech and the Management, Marketing and Strategy department of Telecom Ecole de Management. The focus is on massive data management and mining, web information extraction, social networks analysis, data visualization, online marketing and advertising, and business models.

The Big Data and Market Insights Chair aims to:

  • Tackle some key Big Data challenging problems, develop new solutions and tools
  • Promote Data-driven decision-making and marketing solutions, incorporate Big Data into Business Intelligence (BI), predictive analytics tools and marketing strategies
  • Serve as a framework of exchange between the researchers involved in this Chair and the industrial partners. Share concrete problems, data sets, experiments, innovative solutions, etc.

Its purpose is twofold: support and promote a high quality research activity, on the one hand, and heighten awareness of our students to the economic and technological challenges raised by the Big Data, on the other hand.