Data labeling as an integral part of ML production: real-life usecases and technologies

Olga Megorskaya

AI stands on three pillars: algorithms, hardware and data. While the first two have already become commodities equally available for all players on the market, training data is still both a major bottleneck in ml production and the last bastion defining the unique features of an ai-powered product. In this talk we will look at real life usecases of building data labeling pipelines for the needs of search relevance evaluation, content moderation, marketing surveys and others, and on the automation opensource tools which allow to launch such pipelines from the comfort of your terminal.

  For the last 10 years, Olga has been in charge of developing infrastructure and implementing effective use of data labeling via crowdsourcing for all ML-based products of European tech-giant Yandex, including Search (web, multimedia, geo); voice assistant, speech technologies for Yandex Cloud; Self-Driving Cars, and many more. Olga is a co-author of research papers and tutorials on efficient crowdsourcing and quality control at SIGIR, CVPR, KDD, WSDM, and SIGMOD, and led the panel discussion at the Crowd Science workshop at NeurIPS'20, ICML 21, VLDB 21.