摘要众包是近年来因特网上新兴的一种应用模式。众包为传统监督学习中的数据标注提供 了新的途径。为了从众包工人提供的多噪声标签中推断出样本的真实标签,研究者们提出 了多种真值推断算法。因此,这些算法的评估成为亟待解决的问题。本文设计并实现了一 种面向众包标注的模拟器,来评估真值推断算法的性能。模拟器包含样本生成,标注模拟 和算法评估三个主要部分。样本生成根据用户设置生成分类不同的样本及其真值。标注模 拟根据用户设置选择不同类型和不同数目的标注者,对生成的样本数据进行标注。算法评 估选择不同的真值推断算法,运行于生成的标注数据上,并获得算法的性能指标。本模拟 器用 Java 语言开发,可以跨平台运行。实验显示,模拟器生成的模拟数据完全可以用于真 值推断算法的性能评估。78943
毕业论文关键词:众包标注 真值推理算法 模拟标注 算法评测
毕 业 设 计 说 明 书 外 文 摘 要
Title Design and implementation of a crowdsourcing labeling simulator
Abstract Crowdsourcing is a novel application model emerging on the Internet in recent years。 It provides a new way for data annotation for traditional supervised learning。 To infer the ground truths from the multiple noisy labels provided by crowdsourced workers, researchers have proposed a variety of ground truth inference algorithms。 Thus, evaluating these ground truth inference algorithms becomes a great challenge。 To address this issue,this paper designs and implements a simulator for crowdsourcing labeling to evaluate the performance of ground truth inference algorithms。The simulator contains three main components which are sample generation, labeling simulation and algorithm evaluation。In sample generation, according to user settings, it generates samples and their true labels with different class distributions。 In labeling simulation, it selects different types and different numbers of workers to label the sample, according to user settings。 After simulated data are generated,users can choose different algorithms to run on these data, in which way the performance of the algorithms are evaluated。 The simulator is implemented by Java programming language, which can run on different operation systems。 Experimental results show that the data generated by our simulator can be used to evaluate the performance of ground truth inference algorithms。
Keywords: crowdsourcing labeling ground truth inference labeling simualation algorithm evaluation
本科毕业设计说明书 第 I 页
目 次
1 绪论 1
1。1 众包的概念 1
1。2 众包标注的现状 1
1。3 设计目的与意义 2
1。4 论文结构 3
1。5 本章小结 3
2 众包标注真值推断与评估研究综述