Performance Analysis for Web Scraping Tools: Case Studies on Beautifulsoup, Scrapy, Htmlunit and Jsoup


DİKİLİTAŞ Y., Çakal Ç., Okumuş A. C., Yalçın H. N., Yıldırım E., Ulusoy Ö. F., ...Daha Fazla

International Conference on Emerging Trends and Applications in Artificial Intelligence, ICETAI 2023, İstanbul, Türkiye, 8 - 09 Eylül 2023, cilt.960, ss.471-480 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası: 960
  • Doi Numarası: 10.1007/978-3-031-56728-5_39
  • Basıldığı Şehir: İstanbul
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.471-480
  • Anahtar Kelimeler: Beautifulsoup, HtmlUnit, Jsoup, Scrapy, Web Scraping
  • Kocaeli Üniversitesi Adresli: Evet

Özet

Web scraping has become an indispensable technique for extracting valuable data from websites. With the growing demand for efficient and reliable web scraping tools, it is crucial to assess their performance to guide developers and researchers in selecting the most suitable tool for their needs. In this paper, we present a comprehensive performance analysis of four popular web scraping tools: BeautifulSoup, Scrapy, HtmlUnit, and Jsoup. Our study focuses on evaluating these tools based on metrics such as execution time, memory usage, and scalability. We conducted experiments using various websites and datasets to provide a comprehensive evaluation of the tools’ performance. The results highlight the strengths and limitations of each tool, allowing users to make informed decisions when choosing a web scraping tool based on performance requirements. Additionally, we discuss real-world use cases and the impact of website structures on tool performance. This paper aims to assist developers and researchers in selecting the most appropriate web scraping tool for their specific needs, and it also identifies avenues for future research to further enhance the performance of these tools.