Web Crawler or Scraper

Citation

[@seleniumWriteYour2022] : official website example

[@allenSelenium4New2021] : compare selenium 3 and selenium 4

[@tailemiWebScraper2021] : web scraper 教學

Abstract

website -> (crawl/scrape -> unstructured data -> parse -> (database) -> presentation)

是否有機會讓 database left shift? (1) 分析 unstructured data (with date and meta-data of course); (2) 甚至可以 crawl/scrape data automatically.

剛好找到一個例子:[ImportFromWeb Web scraping in Google Sheets - Google Workspace Marketplace](https://workspace.google.com/marketplace/app/importfromweb_web_scraping_in_google_she/278587576794)

使用 excel 作爲 front-end. 利用 data crawler scrapes web site 可以自動 update 資料。

Why left shift? (1) keep raw data for future analysis/verification; (2) 可以 present date or time sequence evolution; (3) for missing data, 可以主動出擊 (active search).

Introduction

AI 世界, data is the king. Data 從何而來?(1) 有人整理好的 public dataset 或是花錢買或收集的 private dataset; (2) 從 Internet 爬 (crawl) 或抓 (scrape) 出來再整理。

爬或抓是第一步;整理是第二步。本文聚焦在第一步。

分析:

selenium (Python) 3.x or 4.x

scraper (GUI)

整理: BeautifulSoup

Reference