Py / Python網頁資料擷取與分析班-筆記 8/30

有什麼趣味？紀錄自己生活的趣味

Py / Python網頁資料擷取與分析班-筆記 8/30

Post author:莊幸諺
Post published:2022-09-05
Post category:Python / 課程筆記

文章索引

Post Views: 1,827

主要內容

_抓取範例網頁的屬性資料

_爬取TQCPLUS官網的考生見證圖檔

1.抓取範例網頁的屬性資料

from bs4 import BeautifulSoup
html = """
<html><head><title>網頁標題</title></head>
<h1>文件標題</h1>
<div class="content">
    <div class="item1">
        <a href="http://example.com/one" class="red" id="link1">First</a>
        <a href="http://example.com/two" class="red" id="link2">Second</a>
    </div>
    <a href="http://example.com/three" class="blue" id="link3">
        <img src="three.jpg">Third
    </a>
</div>
"""
sp = BeautifulSoup(html,"html.parser") 

# 問題：
# 1.取得 <title>網頁標題</title>
print("1.取得 <title>網頁標題</title>")
print(sp.find("title"))

# 2.取得 http://example.com/one
print("2.取得 http://example.com/one")
print(sp.find("a",id="link1").get("href"))
print(sp.a["href"])

# 3.取得所有類別為red的超連結
print("3.取得所有類別為red的超連結")
list1=sp.find_all("a","red")
for i in range(len(list1)):
    print(list1[i].get("href"))

# 4.取得 First
print("4.取得 First")
print(sp.find("a",id="link1").text)

# 5.取得所有 [<title>網頁標題</title>, <h1>文件標題</h1>]
print("5.取得所有 [<title>網頁標題</title>, <h1>文件標題</h1>]")
print(sp.find_all(["title","h1"]))
      
# 6.取得three.jpg
print("6.取得three.jpg")
print(sp.find("img").get("src"))
print(sp.img["src"])

2.爬取TQCPLUS官網的考生見證圖檔

取得連結的方式都是find_all( ) 或 find( )搭配使用，取得 a標籤的href屬性值

儲存成圖檔的方式是先使用requests模組取得圖檔內容

再用open建立寫入的二進制文件，然後將圖檔內容寫入這個二進制文件

f.write(img.content)

import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.tqcplus.org.tw/ExpList.aspx?oc=t&p=1")
sp= BeautifulSoup(html.text, "html.parser")
sp1=sp.find("div","sp-simpleportfolio-items")
sp2 = sp1.find_all("img")
for i in range(len(sp2)):
    img=requests.get(sp2[i].get("src"))

#儲存所有圖檔，檔名為原來的檔名
    # f=open(sp2[i].get("src")[(sp2[i].get("src").rfind("\\")+1):99],"wb")
    print(sp2[i].get("src")[(sp2[i].get("src").rfind("\\")+1):99])
    # f.write(img.content)
    # f.close()

#存入選定的資料夾
    # f=open("images\\"+sp2[i].get("src")[(sp2[i].get("src").rfind("\\")+1):99],"wb")
    print("images\\"+sp2[i].get("src")[(sp2[i].get("src").rfind("\\")+1):99])
    # f.write(img.content)
    # f.close()

sp2[i].get("src")[(sp2[i].get("src").rfind("\\")+1):99]

get("src")[ ]，可以將取得的src的字串資料分割，[0]->第1個字、[0:2]->取得第1~2個字

rfind( )，取得目標字串最後一次出現的位置，"\\"，第1個\是跳脫字元，結合起來就是要取得 \ 在字串中最後出現的位置

兩者結合起來就是取得 \ 之後的文字

一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

主要內容

_抓取範例網頁的屬性資料

_爬取TQCPLUS官網的考生見證圖檔

1.抓取範例網頁的屬性資料

2.爬取TQCPLUS官網的考生見證圖檔

相關