色欲香天天天综合网站,色屁屁www免费看欧美激情,无码少妇一区二区三区芒果

一、Beautiful Soup的介紹

Beautiful Soup是一個(gè)強(qiáng)大的解析工具，它借助網(wǎng)頁結(jié)構(gòu)和屬性等特性來解析網(wǎng)頁。

它提供一些函數(shù)來處理導(dǎo)航、搜索、修改分析樹等功能，Beautiful Soup不需要考慮文檔的編碼格式。Beautiful Soup在解析時(shí)實(shí)際上需要依賴解析器，常用的解析器是lxml。

二、Beautiful Soup的使用

test03.html測(cè)試實(shí)例：

<!DOCTYPE html>
<html>
<head>
 <meta content="text/html;charset=utf-8" http-equiv="content-type" />
 <meta content="IE=Edge" http-equiv="X-UA-Compatible" />
 <meta content="always" name="referrer" />
 <link  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="stylesheet" type="text/css" />
 <title>百度一下，你就知道 </title>
</head>
<body link="#0000cc">
  <div id="wrapper">
 <div id="head">
  <div class="head_wrapper">
 <div id="u1">
<a class="mnav"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trnews">新聞 </a>
<a class="mnav"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trhao123">hao123 </a>
<a class="mnav"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trmap">地圖 </a>
<a class="mnav"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trvideo">視頻 </a>
<a class="mnav"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_trtieba">貼吧 </a>
<a class="bri"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  rel="external nofollow"  name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>
 </div>
  </div>
 </div>
  </div>
</body>
</html>

1、節(jié)點(diǎn)選擇器

我們之前了解到，一個(gè)網(wǎng)頁是由若干個(gè)元素節(jié)點(diǎn)組成的，通過提取某個(gè)節(jié)點(diǎn)的具體內(nèi)容，就可以獲取到界面呈現(xiàn)的一些數(shù)據(jù)。使用節(jié)點(diǎn)選擇器能夠簡化我們獲取數(shù)據(jù)的過程，在不使用正則表達(dá)式的前提下，精準(zhǔn)的獲取數(shù)據(jù)。

from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.head)
print(soup.head.title)
print(soup.a)

【運(yùn)行結(jié)果】

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下，你就知道 </title>
</head>
<title>百度一下，你就知道 </title>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>

分析：

第一條打印數(shù)據(jù)為獲取網(wǎng)頁的head節(jié)點(diǎn)；

第二條打印內(nèi)容是獲取head節(jié)點(diǎn)中title節(jié)點(diǎn)，獲取該節(jié)點(diǎn)使用了一個(gè)嵌套選擇，因?yàn)閠itle節(jié)點(diǎn)是嵌套在head節(jié)點(diǎn)里面的；

第三條打印內(nèi)容是獲取a節(jié)點(diǎn)，在源碼中我們看到有許多條a節(jié)點(diǎn)，而只匹配到第一個(gè)a節(jié)點(diǎn)就結(jié)束了。當(dāng)有多個(gè)節(jié)點(diǎn)時(shí)，這種選擇方式指只會(huì)選擇第一個(gè)匹配的節(jié)點(diǎn)，其他后面節(jié)點(diǎn)會(huì)忽略。

2、提取信息

一般我們需要的數(shù)據(jù)位于節(jié)點(diǎn)名、屬性值、文本值中，以下代碼展示了如何獲取這三個(gè)地方的數(shù)據(jù)：

from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.body.name)
print(soup.body.a.attrs['class'])
print(soup.body.a.attrs['href'])
print(soup.body.a.string)

【運(yùn)行結(jié)果】

body
['mnav']
http://news.baidu.com
新聞

分析：

第一條獲取body節(jié)點(diǎn)名；

第二條獲取a節(jié)點(diǎn)class屬性值；

第三條獲取a節(jié)點(diǎn)href屬性值；

第四條獲取a節(jié)點(diǎn)的文本值；

3、關(guān)聯(lián)選擇

（1）子節(jié)點(diǎn)和子孫節(jié)點(diǎn)

子節(jié)點(diǎn)可以調(diào)用contents屬性和children屬性，子孫節(jié)點(diǎn)可以調(diào)用descendants屬性，他們返回結(jié)果都是生成器類型，通過for循環(huán)輸出匹配到的信息。

from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
# print(soup.body.contents)
for i,content in enumerate(soup.body.contents):
 print(i,content)

【運(yùn)行結(jié)果】

0

1 <div id="wrapper">
<div id="head">
<div class="head_wrapper">
<div id="u1">
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>
</div>
</div>
</div>
</div>
2

（2）父節(jié)點(diǎn)和祖先節(jié)點(diǎn)

獲取某個(gè)節(jié)點(diǎn)的父節(jié)點(diǎn)可以調(diào)用parent屬性，例如獲取實(shí)例中title節(jié)點(diǎn)的父節(jié)點(diǎn)：

file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.title.parent)

【運(yùn)行結(jié)果】

<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<meta content="always" name="referrer"/>
<link rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="stylesheet" type="text/css"/>
<title>百度一下，你就知道 </title>
</head>

同理，如果是想要獲取節(jié)點(diǎn)的祖先節(jié)點(diǎn)，則可調(diào)用parents屬性。

（3）兄弟節(jié)點(diǎn)

調(diào)用next_sibling獲取節(jié)點(diǎn)的下一個(gè)兄弟元素；

調(diào)用previous_sibling獲取節(jié)點(diǎn)的上一個(gè)兄弟元素；

調(diào)用next_siblings取節(jié)點(diǎn)的下一個(gè)兄弟節(jié)點(diǎn)；

調(diào)用previous_siblings獲取節(jié)點(diǎn)的上一個(gè)兄弟節(jié)點(diǎn)；

4、方法選擇器

find_all（）

查找所有符合條件的元素，其使用方法如下：

find_all(name,attrs,recursive,text,**kwargs)

（1）name

根據(jù)節(jié)點(diǎn)名來查詢?cè)?，例如查詢?shí)例中a標(biāo)簽元素：

from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a"))
for a in soup.find_all(name = "a"):
 print(a)

【運(yùn)行結(jié)果】

[<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>, <a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>]
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>

（2）attrs

在查詢時(shí)我們還可以傳入標(biāo)簽的屬性，attrs參數(shù)的數(shù)據(jù)類型是字典。

from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a",attrs = {"class":"bri"}))

【運(yùn)行結(jié)果】

[<a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>]

可以看到，在加上class=“bri”屬性時(shí)，查詢結(jié)果就只剩一條a標(biāo)簽元素。

（3）text

text參數(shù)可以用來匹配節(jié)點(diǎn)的文本，傳入的可以是字符串，也可以是正則表達(dá)式對(duì)象。

import re
from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name = "a",text = re.compile('新聞')))

【運(yùn)行結(jié)果】

[<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>]

只包含文本內(nèi)容為“新聞”的a標(biāo)簽。

find（）

find（）的使用與前者相似，唯一不同的是，find進(jìn)匹配搜索到的第一個(gè)元素，然后返回單個(gè)元素，find_all（）則是匹配所有符合條件的元素，返回一個(gè)列表。

5、CSS選擇器

使用CSS選擇器時(shí)，調(diào)用select（）方法，傳入相應(yīng)的CSS選擇器；

例如使用CSS選擇器獲取實(shí)例中的a標(biāo)簽

from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
print(soup.select('a'))
for a in soup.select('a'):
 print(a)

【運(yùn)行結(jié)果】

[<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>, <a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>, <a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>]
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trnews">新聞 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trhao123">hao123 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trmap">地圖 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trvideo">視頻 </a>
<a class="mnav" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_trtieba">貼吧 </a>
<a class="bri" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" name="tj_briicon" style="display: block;">更多產(chǎn)品 </a>

獲取屬性

獲取上述a標(biāo)簽中的href屬性

from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
for a in soup.select('a'):
 print(a['href'])

【運(yùn)行結(jié)果】

http://news.baidu.com
https://www.hao123.com
http://map.baidu.com
http://v.baidu.com
http://tieba.baidu.com
//www.baidu.com/more/

獲取文本

獲取上述a標(biāo)簽的文本內(nèi)容，使用get_text()方法，或者是string獲取文本內(nèi)容

from bs4 import BeautifulSoup
file = open("./test03.html",'rb')
html = file.read()
soup = BeautifulSoup(html,'lxml')
for a in soup.select('a'):
 print(a.get_text())
 print(a.string)

【運(yùn)行結(jié)果】

新聞
新聞
hao123
hao123
地圖
地圖
視頻
視頻
貼吧
貼吧
更多產(chǎn)品
更多產(chǎn)品

到此這篇關(guān)于python網(wǎng)絡(luò)爬蟲精解之Beautiful Soup的使用說明的文章就介紹到這了,更多相關(guān)python Beautiful Soup 內(nèi)容請(qǐng)搜索本站以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持本站！

版權(quán)聲明：本站文章來源標(biāo)注為YINGSOO的內(nèi)容版權(quán)均為本站所有，歡迎引用、轉(zhuǎn)載，請(qǐng)保持原文完整并注明來源及原文鏈接。禁止復(fù)制或仿造本網(wǎng)站，禁止在非maisonbaluchon.cn所屬的服務(wù)器上建立鏡像，否則將依法追究法律責(zé)任。本站部分內(nèi)容來源于網(wǎng)友推薦、互聯(lián)網(wǎng)收集整理而來，僅供學(xué)習(xí)參考，不代表本站立場(chǎng)，如有內(nèi)容涉嫌侵權(quán)，請(qǐng)聯(lián)系alex-e#qq.com處理。

排名優(yōu)化：網(wǎng)站排名優(yōu)化方法有什么，如何做有效果

老域名：怎樣才算老域名，老域名建站有什么影響

內(nèi)容優(yōu)化：關(guān)鍵字排名要做哪些方面的優(yōu)化，怎樣做

技巧：網(wǎng)站轉(zhuǎn)化率究竟是什么，有什么提升的技巧

一下吧：外貿(mào)站優(yōu)化有哪些基本的做法和注意事項(xiàng)

概要：競價(jià)推廣費(fèi)用大概要多少呢，競價(jià)推廣好不好

一下吧：SEO中site是什么意思，作用和應(yīng)用是怎樣的

郵箱：付費(fèi)郵箱有哪些優(yōu)勢(shì)，付費(fèi)郵箱挑選要考慮什么

集群是什么意思：集群是什么意思，都有哪些優(yōu)勢(shì)呢