Data Repurposability

1. To Train or Not to Train? How Training Affects the Diversity of Crowdsourced Data. Forty-First International Conference on Information Systems (ICIS), India 2020. With Jeffrey Parsons and Roman Lukyanenko

Organizations and individuals who use crowdsourcing to collect data prefer knowledgeable contributors. They train recruited contributors, expecting them to provide better quality data than untrained contributors. However, selective attention theory suggests that, as people learn the characteristics of a thing, they focus on only those characteristics needed to identify the thing, ignoring others. In observational crowdsourcing, selective attention might reduce data diversity, limiting opportunities to repurpose and make discoveries from the data. We examine how training affects the diversity of data in a citizen science experiment. Contributors, divided into explicitly and implicitly trained groups and an untrained (control) group, reported artificial insect sightings in a simulated crowdsourcing task. We found that trained contributors reported less diverse data than untrained contributors, and explicit (rule-based) training resulted in less diverse data than implicit (exemplar-based) training. We conclude by discussing implications for designing observational crowdsourcing systems to promote data repurposability.

2. Ogunseye, S., Parsons, J., & Lukyanenko, R. (2020). Crowdsourcing for Repurposable Data: What We Lose When We Train Our Crowds. AIS SIGSAND.

Users of crowdsourced data expect that knowledge of the domain of a data crowdsourcing task will positively affect the data that their contributors provide, so they train potential participants on the crowdsourcing task to be performed. We carried out an experiment to test how training affects data quality and data repurposability – the capacity for data to flexibly accommodate both anticipated and unanticipated uses. Eighty-four contributors trained explicitly (using rules), implicitly (using exemplars), and untrained, report the sighting of artificial insects and other entities in a simulated citizen science project. We find that there are no information quality or data repurposability advantages to training contributors. Trained contributors reported fewer differentiating attributes of entities and fewer total attributes of the entities they observed. Trained contributors are therefore less likely to report data that can lead to discoveries. We discuss the implications of our findings to the design of inclusive data crowdsourcing systems.

3. Ogunseye, S., & Parsons, J. (2018). Designing for Information Quality in the Era of Repurposable Crowdsourced User-Generated Content. International Conference on Advanced Information Systems Engineering (pp. 180-185). Springer, Cham.

Conventional wisdom holds that expert contributors provide higher quality user-generated content (UGC) than novices. Using the cognitive construct of selective attention, we argue that this may not be the case in some crowd-sourcing UGC applications. We argue that crowdsourcing systems that seek participation mainly from contributors who are experienced or have high levels of proficiency in the crowdsourcing task will gather less diverse and therefore less repurposable data. We discuss the importance of the information diversity dimension of information quality for the use and repurposing of UGC and provide a theoretical basis for our position, with the goal of stimulating empirical research.

4. Are Frequent Online Reviewers Really Helping? How Experience Affects the Attribute Diversity of Online Reviews. 20th annual Aldrich Conference, St. John’s, NL, Canada

5. Ogunseye, S., Parsons, J., & Lukyanenko, R. (2017). Do crowds go stale? Exploring the effects of crowd reuse on data diversity. WITS 2017.

Crowdsourcing is increasingly used to engage people to contribute data for a variety of purposes to support decision-making and analysis. A common assumption in many crowdsourcing projects is that experience leads to better contributions. In this research, we demonstrate limits of this assumption. We argue that greater experience in contributing to a crowdsourcing project can lead to a narrowing in the kind of data a contributor provides, causing a decrease in the diversity of data provided. We test this proposition using data from two sources-comments submitted with contributions in a citizen science crowdsourcing project, and three years of online product reviews. Our analysis of comments provided by contributors shows that the length of comments decreases as the number of contributions increases. Also, we find that the number of attributes reported by contributors decreases as they gain experience. These finding support our prediction, suggesting that the diversity of data provided by contributors declines over time.

6. Ogunseye, S., & Parsons, J. (2017). What Makes a Good Crowd? Rethinking the Relationship between Recruitment Strategies and Data Quality in Crowdsourcing. In Proceedings of the 16th AIS SIGSAND Symposium (pp. 19-20).

Conventional wisdom dictates that the quality of data collected in a crowdsourcing project is positively related to how knowledgeable the contributors are. Consequently, numerous crowdsourcing projects implement crowd recruitment strategies that reflect this reasoning. In this paper, we explore the effect of crowd recruitment strategies on the quality of crowdsourced data using classification theory. As these strategies are based on knowledge, we consider how a contributor’s knowledge may affect the quality of data he or she provides. We also build on previous research by considering relevant dimensions of data quality beyond accuracy and predict the effects of available recruitment strategies on these dimensions of data quality.

7. The Downside of Expertise: Does Domain Knowledge affect the Quality of Crowdsourced Data? Proceedings of the 15th AIS SIGSAND Symposium. With Jeffrey Parsons

Subject matter expertise is widely believed to have a positive effect on information quality in crowdsourcing. Many crowdsourcing systems are therefore designed to seek out contributions from experts in the crowd. We argue that expert contributors of data in crowdsourcing projects are proficient rule-based classifiers, and are efficient because they attend only to attributes of instances that are relevant to a classification task while ignoring attributes irrelevant to the task at hand. We posit that this selective attention will negatively affect the tendency of experts to contribute data outside of categories anticipated in the design of a class-based data crowdsourcing platform. We propose hypotheses derived from this view, and outline two experiments to test them. We conclude by discussing the potential implications of this work for the design of crowdsourcing platforms and the recruitment of expert versus novice data contributors in studies of data quality in crowdsourcing settings.

8. Ogunseye, S., & Parsons, J. (2016). Can expertise impair the quality of crowdsourced data? In SIGOPEN Developmental Workshop at ICIS.

It is not uncommon for projects that collect crowdsourced data to be commissioned with incomplete knowledge of data contributors, data consumers, and/or the purposes for which the data collected are going to be used. Such unanticipated uses and users of data form the basis for open information environments (OIEs), and the information collected through systems designed to gather content from users have high quality when they are complete, accurate, current and provided in an appropriate format. However, as it is assumed that experts provide higher quality information, many types of OIEs have been designed for experts. In this paper, we question the appropriateness of this assumption in the context of citizen science systems – an exemplary category of OIE. We begin by arguing that experts are primarily efficient rule-based classifiers, which implies that they selectively focus only on attributes relevant to their classification task and ignore others. Drawing from existing literature, we posit that experts’ focus on only diagnostic features of an entity leads to a learned inattention to non-diagnostic attributes. This may improve the accuracy of the information provided, but at the expense of its completeness, currency, format and ultimately the novelty (for unanticipated uses) of information provided. On the other hand, we predict that non-experts and amateurs may use rules to a lesser extent, resulting in less selective attention and leading them to provide more novel information with less trade-off of one dimension of information quality for another. We propose hypotheses derived from this view, and outline two experiments we have designed to test them across four dimensions of information quality. We conclude by discussing the potential implications of this work for the design of crowdsourcing platforms and the recruitment of experts, amateurs, or novice data contributors in studies of data quality in crowdsourcing settings.

9. Afolabi, D., Ogunseye, S., Sennaike, O., Adewole, P. Improving Decision Tree Classification with Ramen: A Ratio-Weighted Approach for Imbalanced Datasets. (2023).

Imbalanced datasets pose a common challenge in real-world applications, often leading to poor quality decision tree classifications. While previous approaches have attempted to tackle this problem, they have faced limitations such as overfitting and loss of useful datasets with significant negative impact on the quality of decisions made through such classifications. In our study, we introduce an optimized ratioweighted decision tree algorithm designed to address these limitations. Our algorithm takes a unique approach by retaining the minority instances present in the original dataset, avoiding the unnecessary discarding of potentially valuable data. By allowing the classification algorithm to determine the appropriate ratio of majority instances, we enhance the classification of minority samples. The results reveal that our proposed algorithm outperforms traditional decision tree classifiers and surpasses the minority entropy algorithm in identifying more members of the minority class. By effectively handling imbalanced data, our algorithm contributes to more reliable and precise decision-making processes.