Data Assessment and Readiness for AI

1st International Workshop on

Data Assessment and
Readiness for AI

@ Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)

11-14 May, 2021, New Delhi, India

Important Notice: The safety and well-being of all workshop participants is our priority. Depending on the COVID-19 situation, we will have the workshop with PAKDD either as planned in New Delhi, India or as an online event.

In the last several years, AI/ML technologies have become pervasive in academia and industry, finding its utility in newer and challenging applications. While there has been a focus to build better, smarter and automated AI pipelines, little work has been done to systematically understand the challenges in determining the readiness of data to be fed to this pipeline. Given a business problem, questions whose answers are still elusive include: how does one select the right data from a data source? Is the data collected of the appropriate quality? If not, what cleaning techniques should be applied, and how to determine if the goals of data cleaning are achieved? and so on. Researchers and practitioners alike have increasingly come to the realization that the real-world utility of an ML model is only as good as the data it has been trained on. Therefore, developing techniques and frameworks that help us determine the readiness of data for training and deploying machine learning models is of utmost importance.

Important Dates

Paper Submission : 10th Feb, 2021 (Extended Deadline)
Author Notification : Feb 22, 2021
Camera-Ready Submission : Mar 8, 2021
All deadlines are at 23:59 Pacific Standard Time (PST).

Call for Papers

Workshop Scope

The goal of this workshop will be to get researchers working in the fields of data acquisition, data labeling, data quality, data preparation and AutoML areas to understand how the data issues, their detection and remediation will help towards building better models. With the focus on different modalities such as structured data, time series data, text data and graph data, this workshop invites researchers from academia and industry to submit novel propositions for systematically identifying and mitigating data issues for making it AI ready. Methods of data assessment can change depending on the modality of the data. This workshop will invite submissions for data readiness for different modalities: structured (or tabular) data, unstructured (such as text) data, graph structured (relational, network) data, time series data, etc. We would like to explore state-of-the-art deep learning and AI concepts such as deep reinforcement learning, graph neural networks, self-supervised learning, capsule networks and adversarial learning to address the problems of data assessment and readiness.

Topics of Interest

  • Algorithms for explainable data quality detection and remediation for ML
  • Automated data cleaning workflows with explanations
  • Smarter data visualizations for high dimensional data
  • Autolabel datasets from small labels of data
  • Label noise detection, explanation and incorporating feedback
  • Incorporating domain knowledge for data cleaning and data transformations
  • Data privacy and encryption techniques, with impact to ML pipeline
  • Auto ordering of datasets based on difficulty level with explanations
  • Outlier (or anomaly) detection and mitigation in data
  • Detection of bias in data
  • Handling corrupted, missing and uncertain data
  • Noisy Data Evaluation and Cleaning Recommendation
  • Syntactic Data Validations

Submission Instructions

Authors are invited to submit original, previously unpublished research papers. Research papers, up to 12 pages, describing original and novel research work, including research results and evaluations should be submitted. Research papers should not have been published or submitted for publication concurrently elsewhere.

Papers should be written in English, following Springer LNCS style including all text, references, appendices, and figures. Since it is single blind review process, please include author names and affiliations. For formatting instructions and templates, see the Springer Web page: (LNCS Template Overleaf). Submitted papers will be evaluated by at least three members of the international program committee. At least one author of each accepted paper must register and participate in the workshop to present the paper. The workshop papers will be included in LNCS/LNAI post Proceedings of PAKDD Workshops published by Springer .


Submissions should be made via the Easychair system through the submission page available here:

Authors should consult Springer’s authors’ guidelines and use their proceedings templates, either for LaTeX or for Word, for the preparation of their papers. Springer encourages authors to include their ORCIDs in their papers. In addition, the corresponding author of each paper, acting on behalf of all of the authors of that paper, must complete and sign a Consent-to-Publish form. The corresponding author signing the copyright form should match the corresponding author marked on the paper. Once the files have been sent to Springer, changes relating to the authorship of the papers cannot be made.

The submitted papers must not be previously published anywhere and must not be under consideration by any other conference or journal during the data-datareadiness2021 review process.

Guidelines for Video Presentation of Accepted Papers

Refer to this doc for creating and submitting the video presentations.


Laure Berti-Equille
Research Director in Computer Science at IRD, the French Institute of Research in Sustainability Science
Keynote talk on Data curation for ML: Toward a Principled Approach

Abstract: Data cleaning and preparation are the first critical tasks that can affect result quality and robustness of machine learning pipelines. This talk will present previous and recent contributions in data curation for AI. Discovering patterns of errors is important because it may change the data pre-processing strategy: from handling anomalies in isolation to handling intricate glitches in a principled way. Multiple types of errors co-exist in training, testing, and validation datasets with various distributions: the presence of one type of glitch can hinder the detection of another type of glitch. Different orderings in the sequence of tasks for cleaning and pre-processing the data may lead to dramatically different pre-processed datasets, and ultimately different ML results. Therefore, it is essential to keep track of and evaluate the candidate data transformation pipelines, provide comparative analysis and explanations to the users, and recommend the optimal data pre-processing strategy. In this line, we have used reinforcement learning and developed Learn2Clean, a system that selects, for a given data set, ML model, and quality performance metric, the optimal sequence of tasks for pre-processing the data such that the quality metric is maximized. Finally, the talk will conclude and discuss some challenging research directions at the intersection of machine learning and data management for orchestrating seamlessly automated and Human-in-the-Loop (HIL) tasks for optimal data pre-processing for end-to-end ML pipelines.

Bio: Laure Berti-Equille is a Research Director in Computer Science at IRD, the French Institute of Research in Sustainability Science since 2011. Before, she was a full Professor at Aix-Marseille University (AMU) in France (2017-2018). From 2014-2017, she was a Senior Scientist of Qatar Computing Research Institute (Hamad Bin Khalifa University), a research institute in Computer Science from Qatar Foundation. From 2000-2010, she was a tenured Associate Professor at University of Rennes 1 in France, and a 2-years visiting researcher at AT&T Labs Research in New Jersey, USA, as a recipient of the prestigious European Marie Curie Outgoing Fellowship (2007-2009). Her research work is at the intersection of large-scale data analytics and machine learning with a focus on data quality and applied research with many collaborations with industries and more than 80 publications and three monographs. She organized several scientific workshops in conjunction with top-tier conferences such as SIGMOD and VLDB and gave many tutorials and keynote talks (KDD, CIKM, ICDE, ICDM). Laure is serving as an associated editor of various scientific journals: VLDB Journal, ACM Journal on Data and Information Quality, and Frontiers in Big Data Science, and served in many conference program committees (VLDB, SIGMOD, ICDE). She has received various grants from the French Agency for National Research (ANR), the French National Research Council (CNRS), and the European Union.

Abir De
Assistant Professor at Department of Computer Science and Engineering, Indian Institute of Technology Bombay
Invited talk on Machine Learning with Human in Loop

Abstract: Decisions are increasingly taken by both humans and machine learning models. However, machine learning models are currently trained for full automation-they are not aware that some of the decisions may still be taken by humans. In this talk, we tackle two problems towards making machine learning models aware of the presence of human decision-makers. In this talk, we introduce the convex learning problem under human assistance and show that it is NP-hard. Then, we derive an alternative representation of the corresponding objective function as a difference of non-decreasing submodular functions. Building on this representation, we further show that the objective is non-decreasing and satisfies α-sub modularity, a recently introduced notion of approximate sub-modularity. These properties allow simple and efficient greedy algorithms to enjoy approximation guarantees at solving the problem. Experiments on synthetic and real-world data from two important applications-medical diagnoses and content moderation-demonstrate that the greedy algorithm beats several competitive baselines.

Bio: Abir De is an assistant professor in CSE Department at IIT Bombay. Prior to this, he was a postdoctoral researcher in Max Planck Institute for Software Systems at Kaiserslautern, Germany since January 2018. He received his PhD from the Department of Computer Science and Engineering, IIT Kharagpur in July 2018. During that time, he was a part of the Complex Network Research Group (CNeRG) at IIT Kharagpur. His PhD was awarded the INAE best PhD thesis award. He was supported by Google India PhD Fellowship 2013. Prior to that, he did his BTech in Electrical Engineering and MTech in Control Systems Engineering both from IIT Kharagpur. His main research interests broadly lie in modeling, learning and control of networked dynamical processes. Very recently, he started working on human centric machine learning. His publications can be accessed from here.

Organizing Committee

Program Committee

  • Shanmukha C Guttula, IBM Research
  • Aniya Aggarwal, IBM Research
  • Pranay Lohia, IBM Research
  • Vitobha Munigala, IBM Research
  • Ruhi Sharma Mittal, IBM Research
  • Lokesh N, IIT-B
  • Naveen Panwar, IBM Research
  • Kishalay Das, Indian Institute Of Science
  • Vishal Saley, Indian Institute Of Science
  • Arushi Prakash, Amazon
  • Paarth Gupta, SMVDU

Accepted Papers

Paper: Cooperative Monitoring of Malicious Activity in Stock Exchanges
Authors: Bhavya Kalra, Sai Krishna Munnangi, Kushal Majmundar, Naresh Manwani and Praveen Paruchuri
Abstract: Stock exchanges are marketplaces to buy and sell securities such as stocks, bonds and commodities. Due to their prominence, stock exchanges are prone to a variety of attacks which can be classified as external and internal attacks. Internal attacks aim to make profits by manipulation of trading processes e.g., Spoofing, Quote stuffing, Layering and others, which are the specific focus of this paper. Different types of proprietary fraudulent activity detectors are deployed by stock exchanges to analyze the time series data of trader's activities or the activity of a particular stock to flag potentially malicious transactions while human analysts probe the flagged transactions further. The key issue faced here is that while the number of anomalous transactions identified can run into thousands or tens of thousands, the number of such transactions that can realistically be probed by human analysts would be a small fraction due to resource constraints. The issue therefore reduces to a dynamic resource allocation problem wherein alerts that represent the most malicious transactions need to be mapped to human analysts for further probing across different time intervals. To address this challenge, we encode the scenario as a Cooperative Target Observation (CTO) problem wherein the analysts (modeled as observers) perform a cooperative observation of alerts that represent potentially malicious activity (modeled as targets) and develop multiple solution approaches in order to identify malicious activity.

Paper: Data-Debugging through Interactive VisualExplanations
Authors: Shazia Afzal, Arunima Chaudhary, Nitin Gupta, Hima Patel, Carolina Spina and Dakuo Wang
Abstract: Data readiness analysis consists of methods that profile data and flag quality issues to determine the AI readiness of a given dataset. Such methods are being increasingly used to understand, inspect and correct anomalies in data such that their impact on downstream machine learning is limited. This often requires a human in the loop for validation and application of remedial actions. In this paper we describe a tool to assist data workers in this task by providing rich explanations to results obtained through data readiness analysis. The aim is to allow interactive visual inspection and debugging of data issues to enhance interpretability as well as facilitate informed remediation actions by humans in the loop.

Paper: Data Augmentation for Fairness in Personal Knowledge Base Population
Authors: Lingraj S Vannur, Balaji Ganesan, Lokesh Nagalapatti, Hima Patel and Thippeswamy Mn
Abstract: Cold start knowledge base population (KBP) is the problem of populating a knowledge base from unstructured documents. While neural networks have led to improvements in the different tasks that are part of KBP, the overall F1 of the end-to-end system remains quite low. This problem is more acute in personal knowledge bases, which present additional challenges with regard to data protection, fairness and privacy. In this work, we use data augmentation to populate a more complete personal knowledge base from the TACRED dataset. We then use explainability techniques and representative set sampling to show that the augmented knowledge base is more fair and diverse as well.

Program Schedule

The workshop is scheduled to be conducted on May 11, 2021 from 14:00-17:30 IST.

Time Duration Program Title
14:00-14:05 Opening Remarks
14:05-15:05 Keynote speech by Dr. Laure Berti-Equille Data curation for ML: Toward a Principled Approach
15:05-15:15 Networking with peers
15:15-15:35 Paper Presentation Cooperative Monitoring of Malicious Activity in Stock Exchanges, Bhavya Kalra
15:35-15:55 Paper Presentation Data-Debugging through Interactive Visual Explanations, Shazia Afzal
15:55-16:15 Paper Presentation Data Augmentation for Fairness in Personal Knowledge Base Population, Lingraj S Vannur
16:15-16:25 Break
16:25-17:25 Invited Talk by Prof. Abir De Machine Learning with Human in Loop
17:25-17:30 Conclusive Remarks

Workshop Proceedings

The workshop proceedings are now available.

Workshop Proceedings

Contact Information

For any queries reach out to us at