Simone Rebora presented this topic during the Digital Social Reading course at the University of Basel and his lecture was moderated by Catherine Coloma and Ramon Sanchez.
Authors: Catherine Coloma and Ramon Erdem-Sanchez
In the ever-evolving landscape of digital research, methodologies are instrumental in unlocking the vast potential of data sourced from online platforms. Researchers, functioning as architects of the digital frontier, explore realms such as social networking sites, web scraping, and dynamic content, shaping the narrative of knowledge acquisition in unprecedented ways. The vastness, diversity, and rapid growth of Big Web Data present challenges for individual researchers and teams, making manual collection and organization nearly impossible and impractical (Krotov and Tennyson 2018; Krotov and Silva 2018). Consequently, researchers frequently turn to tools and technologies, collectively known as web scraping, to automate processes involved in data collection and organization from the web (Krotov and Tennyson 2018; Krotov and Silva 2018). However, the ethical and legal dimensions of data extraction and analysis impose a critical duty on researchers to navigate the digital terrain with prudence and integrity. Therefore, it becomes imperative to explore not only the methodologies driving digital research but also the ethical considerations underpinning the responsible use of the data at hand. Rebora, S. (2023, October 26) introduced three approaches to web scraping:
Application Programming Interfaces (API)
APIs offer a structured and legal means for digital researchers to access data from social networking sites, ensuring ease of use within the boundaries set by API providers. Despite the advantages, limitations arise as access is confined to the data made available by website owners, and costs associated with APIs can be prohibitive, exemplified by Twitter's API showcasing issues like limited data access and changes in data availability and access methods over time. In contrast, YouTube stands out as an example of a platform providing accessible data through APIs, underscoring the variability in API quality among social networking sites.
Static Web Scraping
Static web scraping, an alternative to APIs, involves extracting information from websites using programming languages like R and Python (Krotov, Johnson & Silva, 2020). Static web scraping retrieves data embedded in a website's HTML code, offering cost-effective access to stable content. However, as the name implies, it only applies to static web pages and is a more complex form of web scraping. A static web page is a basic type of web page that remains the same each time you load or view it. It doesn’t change or update based on a website visitor’s actions and doesn’t display new information from a server.
Dynamic Web Scraping
Dynamic web scraping addresses the limitations of static web scraping by providing access to constantly evolving content, particularly relevant for dynamic sites like social networking platforms. This is a more complex form of web scraping, but it provides almost full coverage and access to data. It also leaves researchers in a lot more 'grey areas' as there are not enough academic studies yet on this topic for researchers to determine what is legal and ethical in the use of this approach. The legal ambiguity surrounding its use necessitates a closer examination of the potential consequences and ethical considerations involved.
Web scraping, particularly dynamic scraping, poses critical legal and ethical considerations for researchers. The jurisdictional responsibility for claims of unfair use and data access, along with copyright issues, is a complex landscape. Cameron Gray's 2021 article highlights the lack of international agreements on internet use and data extraction, leaving decisions to individual cases and their respective jurisdictions. The U.S. Copyright Office supports the idea that fair use is not universally defined, making it a case-by-case determination. Many jurisdictions allow exemptions to standard copyright protections for research purposes under the "Fair Use Doctrine," protecting researchers for most research purposes. However, the legal ambiguity surrounding web scraping, especially in the absence of enforceable contracts, necessitates a cautious approach. The open and communal nature of online interactions on social networking sites is a valuable resource for researchers amidst these legal complexities, emphasizing the need to navigate these practices with prudence and responsibility.
The ethical dimension of web scraping is equally significant for researchers. The term 'ethics' in web scraping relates to the moral principles governing the exchange of goods and services. The paradox of ethics in web scraping is evident in the open nature of the web, driven by principles of accessibility. The large amount of user data, while invaluable for businesses, raises ethical concerns regarding its protection and use for research purposes. Despite the legal theories guiding web scraping, there is a gap in literature addressing the subtle ethical issues related to this emerging practice. As researchers grapple with the legal and ethical implications, questions arise about the universal definition of ethics and its applicability across different jurisdictions. The European Union's General Data Protection Regulation (GDPR) serves as an example of comprehensive privacy regulations, highlighting the challenges researchers face when conducting web scraping in regions without such laws. Researchers are urged to consider overarching questions addressing both ethical and legal implications, emphasizing the potential harm to individuals, organizations, or communities. The expansive datasets of the Big Web offer numerous research opportunities, but the legality and ethics of data collection from the Web remain 'grey areas.' From a personal standpoint, prioritizing an ethical and legal approach to data collection should be a foundational step for researchers, emphasizing its importance from the outset.
As researchers navigate the complexities of web scraping, the intertwined aspects of legal and ethical considerations underscore the need for a balanced and responsible approach. The open nature of online platforms provides valuable resources for researchers, but the lack of international agreements and evolving legal landscapes necessitate constant vigilance. While legal theories guide researchers, ethical nuances demand attention, especially in the absence of a comprehensive framework. The ethical sourcing of information becomes fundamental in web scraping, not as an afterthought but as an integral part of the research process. Researchers must prioritize an ethical and legal approach, ensuring that the pursuit of knowledge through web scraping aligns with principles of responsibility, collaboration, and integrity.
Dittmer, J. (n.d.). Applied ethics. The Internet Encyclopedia of Philosophy. Retrieved from https://iep.utm.edu/
Gov U.S. Copyright Office (n.d.). U.S. Copyright Office fair use index. Retrieved from https://www.copyright.gov/fair-use/
Gray, CC. (2021, May 8). Ethical concerns surrounding web scraping & internet data. Bangor University.
Krotov, V., Johnson, L., & Silva, L. (2020). Tutorial: legality and ethics of web scraping. Communications of the Association for Information Systems, 47, https://doi.org/10.17705/1CAIS.04724
Lancaster University (n.d.) GDPR: What researchers need to know. Retrieved from https://www.lancaster.ac.uk/research/research-services/research-integrity-ethics--governance/data-protection/gdpr-what-researchers-need-to-know/
Rebora, S. (2023, October 26). Scraping data from social networking sites. Digital Social Reading, University of Basel.
Singer, P. (2023, December 26). Ethics. Encyclopedia Britannica. Retrieved from https://www.britannica.com/topic/ethics-philosophy
Snell, J. & Nicola, M. (2016). Web scraping in an era of big data 2.0." Bloomberg Law News.