This blogpost made by the moderators of the lecture by Massimo Lusetti on Machine Learning and its relevance to Digital Social Reading research.
Authors: Georgina Oduba and Leonie Hormes.
Machine Learning is a branch of Artificial Intelligence that focuses on the development of Algorithms and Statistical models that enable computers to learn and make decisions without being explicitly programmed. Computers are trained on a dataset to recognize patterns which they can now use to make predictions or take specific actions. In the dynamic realm of machine learning, the journey from raw data to a trained model involves a series of meticulously orchestrated steps. Each stage plays a crucial role in shaping the capabilities of the machine, ultimately leading to the desired outcome.
In Digital Social Reading, machine learning is a tool used to investigate reading practices online. Here, classifiers are trained to identify absorption in online book reviews and the procedure involves training and testing the model, which is especially necessary when there are large amounts of data available. Generally, it is useful as it can improve the interpretation of the reaction of readers. To what extent were they absorbed as they read the texts? As shown through the demonstration in the seminar, the 0.23 and 0.35 given by Naive Bayes and Logistic Regression algorithms respectively show that the model is not good at finding absorption where there is evidence of absorption. It is only good in identifying cases of non-absorption.
Machine Learning is invaluable in research for its ability to analyze vast amounts of data, identify complex patterns, and extract meaningful insights. It aids researchers in making predictions, classifying information, and discovering hidden relationships within datasets. Its automation capabilities enhance efficiency and provide a data-driven foundation for decision-making.
While Machine Learning has proven to be a powerful tool, it cannot be fully relied upon without critical consideration. The reliability of machine learning models depends on the quality and representativeness of the training data, the appropriateness of the chosen algorithm, and the thoroughness of the model evaluation process.
The initial step in the machine learning process involves data collection, where diverse information is gathered. This data can span various formats, including text, images, videos, and more. The abundance and quality of the collected data are foundational to the success of subsequent stages.
In our example, annotation by humans serves as a training ground for machine learning models, showing the machine the examples and thus training them to discern patterns and make informed decisions. The focus is on guiding the machine by presenting it with numerous examples, enabling it to learn the nuances of the task at hand. In this example, the computer can “only” decide between absorption and non-absorption.
Machine Learning is not universally applicable to all research scenarios. Its relevance depends on the nature of the data, the research question, and the specific goals of the study. In cases where human intuition, creativity, or domain expertise is crucial, machine learning may complement but not replace these essential elements.
Machine Learning has several shortcomings, including:
Data Quality Dependence: Models heavily rely on the quality and representativeness of training data.
Interpretability Issues: Some complex models, like neural networks, are perceived as black boxes, making it challenging to interpret their decisions.
Overfitting and Underfitting: Models may perform poorly if they are too complex (overfitting) or too simplistic (underfitting).
Bias and Fairness: Models can inherit biases present in training data, leading to unfair or discriminatory outcomes.
Methodology failures in machine learning, such as inadequate data preprocessing, biased training data, or improper model evaluation, can significantly impact the interpretation of data and results. These failures may lead to inaccurate predictions, misclassifications, or biased outcomes, undermining the reliability and validity of the research findings.
Machine Learning has the potential to both simplify and complicate research. On the one hand, it simplifies tasks by automating data analysis and pattern recognition, saving time and effort. On the other hand, it introduces complexities in terms of model selection, parameter tuning, and addressing ethical considerations, especially when dealing with sensitive data or biased outcomes. The complexity often shifts from manual data analysis to the intricacies of designing, training, and interpreting machine learning models.
In essence, the impact of machine learning on research depends on its careful integration, consideration of limitations, and alignment with the specific needs and goals of the research endeavor.
The intersection of human reliability and machine reliability raises questions about inter-annotator agreement. It involves assessing the consistency between different annotators and reconciling conflicting opinions to ensure accurate training data. Additionally, the reliability of human annotations is assumed, despite the potential for errors. This is why there is the inter-annotator agreement. Two annotators might have a different opinion, you don’t know who might be correct. Labels are aggregated to decide who is correct. Once the decision has been made, it will be used to train the machine.
In conclusion, the machine learning process is a multi-faceted journey that encompasses data collection, human annotation, supervised learning, model testing, and evaluation. Each step contributes to the overall goal of training a model capable of making informed decisions. As technology advances, addressing questions of reliability, incorporating diverse data sources, and refining models will continue to shape the evolving landscape of machine learning.
Jurafsky, D. & Martin J. H. (2023). Speech and Language Processing: An Introduction to Natural Language processing, Computational Linguistics and Speech Recognition. 3rd Edition Draft. pp. 56-59.
Lendvai, P., Rebora, S., & Kuijpers M. M. (2020). Identification of reading Absorption in User-Generated Book Reviews. Proceedings of the 15th Conference on Natural Language Processing ( KONVENS 2019).
Lusetti, M. (2023). Taking empirical research on Digital Social Reading to a larger scale with machine learning. Presentation Digital Social Reading Course, University of Basel.