python系列之(1)BeautifulSoup的用法
好久没更新博客了。打算写一个python的爬虫系列及数据分析。falg也不能随便立,以免打脸。
python爬取内容,是过程,分析数据是结果,最终得出结论才是目的。python爬虫爬取了内容,一般都是从网页上获取,那我们从html页面中如何提取出自己想要的信息呢?那就需要解析。目前常用的有BeautifulSoup、PyQuery、XPath和正则表达式。正则容易出错,而且一直是弱项,就讲讲其他三个的使用,今天先看下BeautifulSoup.
一、简介
BeautifulSoup直译为美丽的汤。是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式。
二、安装
pip install beautifulsoup4
三、准备测试代码
这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档)
<html><head><title>The Dormouse\'s story</title></head> <body> <p class="title"><b>The Dormouse\'s story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> </html>
我们先以上述代码为例进行测试
四、使用
from bs4 import BeautifulSoup html_doc = """ <html><head><title>The Dormouse\'s story</title></head> <body> <p class="title"><b>The Dormouse\'s story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> </body> </html> """ soup = BeautifulSoup(html_doc, features="html.parser") #print(soup.prettify()) print(soup.title) #<title>The Dormouse\'s story</title> print(soup.title.name) #title print(soup.title.string) #The Dormouse\'s story print(soup.title.parent.name) #head print(soup.p) #<p class="title"><b>The Dormouse\'s story</b></p> print(soup.p[\'class\']) #[u\'title\'] print(soup.a) #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> print(soup.find_all(\'a\')) #[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] print(soup.find(id=\'link3\')) #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> for link in soup.find_all(\'a\'): print(link.get(\'href\')) #http://example.com/elsie #http://example.com/lacie #http://example.com/tillie print(soup.get_text()) #The Dormouse\'s story #The Dormouse\'s story #Once upon a time there were three little sisters; and their names were #Elsie, #Lacie and #Tillie; #and they lived at the bottom of a well. #...
以上注释的都是上一行输出的
五、BeautifulSoup可以传入字符串或文件句柄
from bs4 import BeautifulSoup soup = BeautifulSoup(\'<b class="boldest">Extremely bold</b>\', features="lxml") tag = soup.b print(tag) #<b class="boldest">Extremely bold</b> tag.name = "blockquote" print(tag) #<blockquote class="boldest">Extremely bold</blockquote> print(tag[\'class\']) #[\'boldest\'] print(tag.attrs) #{\'class\': [\'boldest\']} tag[\'id\']="stylebs" print(tag) #<blockquote class="boldest" id="stylebs">Extremely bold</blockquote> del tag[\'id\'] print(tag) #<blockquote class="boldest">Extremely bold</blockquote> css_soup = BeautifulSoup(\'<p class="body strikeout"></p>\', features="lxml") print(css_soup.p[\'class\']) #[\'body\', \'strikeout\'] id_soup = BeautifulSoup(\'<p id="my id"></p>\', features="lxml") print(id_soup.p[\'id\']) #my id rel_soup = BeautifulSoup(\'<p>Back to the <a rel="index">homepage</a></p>\', features="lxml") print(rel_soup.a[\'rel\']) #[\'index\'] rel_soup.a[\'rel\'] = [\'index\', \'contents\'] print(rel_soup.p)
参考文档 : https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id40