[Python入门学习]-爬虫项目案例讲解

一.速成HTML

　　html：超文本标记语言。

　　文档的第一行<!DOCTYPE html>就表明这是一个html文档。根标签是html，然后下面有head和body，head里面是一些头信息，body就是我们想把页面渲染成什么样。

　　<meta charset=”UTF-8″>声明字符编码是UTF-8的。

　　前端技术语言体系：

　　html

　　css：层叠样式表

　　js：javaScript

　　树形关系：先辈、父、子、兄弟、后代

二.xpath

　　/：从根节点来进行选择元素

　　//：从匹配选择的当前节点来对文档中的节点进行选择

　　.：选取当前节点

　　..：选择当前节点的父节点

　　@：选择属性

　　实例：

　　/html：选取根元素html

　　body/div：选取属于body的子元素中的所有div元素

　　//div：选取所有div标签的子元素，而不管他们在html文档中的位置

　　@lang：选取名称为lang的所有属性

　　通配符

　　*：匹配任何元素节点

　　@*：匹配任何属性节点

　　实例：

　　//*：选取文档当中的所有元素

　　//title[@*]：选取所有带有属性的title元素

　　|：在路径表达式中，|代表的是和的关系，如//body/div | //body/li表示选取body元素的所有div元素和li元素

　　//div | //li：选取文档中所有的div和li元素

三.BuautifulSoup的介绍

　　什么是beautifulSoup？
　　是一个可以从html或者是xml文件中提取数据的一个python库
　　安装命令：pip install beautifulsoup4

　　在PyCharm的Terminal窗口输入上面的安装命令即可以安装。

　　我这里是从同花顺随机打开一支个股，找到公司资料->高管介绍，通过F12的方式找到对应的html文件，然后将其另存为到本地名为000004.html

　　然后编写解析代码：

\'\'\'
什么是beautifulSoup？
是一个可以从html或者是xml文件中提取数据的一个python库
pip install beautifulsoup4
\'\'\'
from bs4 import BeautifulSoup

html_doc = "E:/Python/PythonStudy/000004.html"
html_file = open(html_doc,"r", encoding="gbk")
html_handle = html_file.read()
soup = BeautifulSoup(html_handle, \'html.parser\')
print(soup)

　　运行效果：

　　特别说明，由于这里下载下来的文档的格式是GBK编码，我们如果强制指定UTF-8编码的话，就会报错。

四.如何使用BuautifulSoup中的选择器

\'\'\'
什么是beautifulSoup？
是一个可以从html或者是xml文件中提取数据的一个python库
pip install beautifulsoup4
\'\'\'
from bs4 import BeautifulSoup
import re

html_doc = "E:/Python/PythonStudy/000004.html"
html_file = open(html_doc,"r",encoding="gbk")
html_handle = html_file.read()
soup = BeautifulSoup(html_handle, \'html.parser\')
#print(soup)

#获取html文档头
#print(soup.head)

#获取文档中的一个节点
print(soup.p)

#获取节点中的属性
print(soup.p.attrs)

#获取所有的相应的节点
ps = soup.find_all("p")
#print(ps)

#用ID来进行定位
result = soup.find_all(id="quotedata")
print(result)

#按照CSS来搜索
jobs = soup.find_all("td", class_="jobs")
print(jobs)

names = soup.find_all("a", class_="turnto")
print(names)

r = re.findall(">(.{2,5})</a>", str(names))
print(r)

五.Scrapy基础环境

　　在PyCharm中输入pip install scrapy安装scrapy。但报error: Microsoft Visual C++ 14.0 is required. Get it with “Microsoft Visual C++ Build Tools”错误，网上查了有如下两种方式解决，一种是根据报错信息去官网下载 Microsoft Visual C++ 14.0，另一种方式就是去https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted下载twisted对应版本的whl文件Twisted‑18.9.0‑cp37‑cp37m‑win_amd64.whl，cp后面是python版本，amd64代表64位，运行pip install E:\Python\Twisted-18.9.0-cp37-cp37m-win_amd64.whl(路径是你存放Twisted下载存放的目录)

　　将手动下载的Twisted安装后，再运行pip install scrapy命令安装scrapy。

　　验证scrapy是否安装成功，我们打开同花顺网站的A股个股：http://stockpage.10jqka.com.cn/，随便输入一个股票，如600004，然后选择公司资料，即http://stockpage.10jqka.com.cn/600004/company/#detail。

import scrapy
html = scrapy.Request("http://stockpage.10jqka.com.cn/600004/company/#detail")
print(html)

　　如上输出表示scrapy安装成功了。

　　scrapy和beautiful soup的区别？beautiful soup只能说是一个库，一个python的三方库，scrapy是一个框架。框架和库的不同之处在于：库拿过来然后到你的工程里边直接写，框架它会自动的帮你做很多事，你到框架里边去填充你这些东西。

　　这时，在part6下创建了stock_spider

　　这时，正确的用法是，在PyCharm打开这个项目。

六.Scrpay使用逻辑介绍

　　进入stock_spider目录下，创建爬虫命令：scrapy genspider tonghuashun http://stockpage.10jqka.com.cn/600004/company/#detail

　　但发现一直提示srcapy不是内部或外部命令，不得已我又在我的Python的安装目录下重新安装scrapy。

　　执行创建爬虫的命令：scrapy genspider tonghuashun http://stockpage.10jqka.com.cn/600004/company/#detail

　　打开tonghuashun.py

　　进入http://stockpage.10jqka.com.cn/600004/company/#detail页面，按F12，定位元素后，右键->Copy->Copy XPath，获得预解析的文档路径信息。

　　tonghuashun.py内容如下：

# -*- coding: utf-8 -*-
import scrapy

class TonghuashunSpider(scrapy.Spider):
    name = \'tonghuashun\'
    allowed_domains = [\'stockpage.10jqka.com.cn\']
    start_urls = [\'http://stockpage.10jqka.com.cn/600004/company/#detail/\']

    def parse(self, response):
        # //*[@id="ml_001"]/table/tbody/tr[1]/td[1]/a
        res_selector = response.xpath("//*[@id=\"ml_001\"]/table/tbody/tr[1]/td[1]/a")
        print(res_selector)
        pass

　　为了测试，在stock_spider下新建main.py内容如下，用来测试调试验证

　　运行发现报“ModuleNotFoundError: No module named \’win32api\’”错误，于是进入python安装目录下，执行pip install pywin32命令。

　　但执行main.py后，没有输入任何的内容。

　　于是F12分析，找到真正的URL是http://basic.10jqka.com.cn/600004/company.html，XPath是正确无误的。

　　动态页面：我的页面是从数据库或其他地方得到，然后渲染的页面

　　静态页面：所见即所得

　　修改后tonghuashun.py内容如下所示：

# -*- coding: utf-8 -*-
import scrapy

class TonghuashunSpider(scrapy.Spider):
    name = \'tonghuashun\'
    allowed_domains = [\'stockpage.10jqka.com.cn\']
    #start_urls = [\'http://stockpage.10jqka.com.cn/600004/company/#detail/\']
    start_urls = [\'http://basic.10jqka.com.cn/600004/company.html\']

    def parse(self, response):
        # //*[@id="ml_001"]/table/tbody/tr[1]/td[1]/a
        res_selector = response.xpath("//*[@id=\"ml_001\"]/table/tbody/tr[1]/td[1]/a")
        print(res_selector)
        pass

\'\'\'
动态页面：我的页面是从数据库或其他地方得到，然后渲染的页面
静态页面：所见即所得
\'\'\'

　　运行效果如下：

　　当然，我们希望是获取a标签中的文本值，怎么获取呢？其实就在选择器后面加上/text()即可获得。

　　然后可以通过res_selector.extract()拿到文本值，如下所示：

七.定位

　　但上面的那种定位调试太慢，其实在我们创建这个爬虫之前有一个scrapy shell命令，它可以很直观的反馈元素是否定位到，即可以用scrapy shell命令调试xpath定位。

　　输入命令：scrapy shell http://basic.10jqka.com.cn/600004/company.html后，可输入response.xpath(“//*[@id=\”ml_001\”]/table/tbody/tr[1]/td[1]/a/text()”).extract()看是否可以定位到。

　　下面分析定位所有董事：

　　然后，把这代码放到工程中实现：

　　进一步看另一个实例：

八.爬虫

　　因为程序爬同花顺网站，可能会因为速度过快被同花顺把我们本地的IP给封了，而我们是来学习的，所以下面将用http://pycs.greedyai.com/来进行练习。

　　下面先来创建一个虫。

　　修改stock.py内容如下：

# -*- coding: utf-8 -*-
import scrapy
from urllib import parse

class StockSpider(scrapy.Spider):
    name = \'stock\'
    allowed_domains = [\'pycs.greedyai.com/\']
    start_urls = [\'http://pycs.greedyai.com/\']

    def parse(self, response):
        post_urls = response.xpath("//a/@href").extract()
        for post_url in post_urls:
            yield scrapy.Request(url=parse.urljoin(response.url, post_url), callback=self.parse_detail, dont_filter=True)
        pass

    def parse_detail(self, response):
        print("回调函数被调用")
        pass

　　url=parse.urljoin(response.url, post_url)：url域名拼接，如果有域名就不加域名，如果没有域名就加上域名。

　　callback=self.parse_detail：定义一个函数，来对响应进行解析

　　dont_filter=True：是否不要启动scrapy过滤器过滤非正规URL

　　yield：在这里是把它交给scrapy进行处理，和return差不太多

　　并修改main.py，运行

九.定位页面元素

十.处理抓取信息

　　pipelines就是处理我们数据的，要想让程序能进入pipelines，必须先在items.py中定义

　　窗口分隔完后，可以方便定义变量

　　定义变量后，然后在stock.py中进行处理

　　以上都处理完后，还要在settings.py中进行设置。

　　这时在pipelines.py中打断定，Debug运行main.py，可以看到数据都已获取到。

　　此时，相关的代码如下：

　　stock.py

# -*- coding: utf-8 -*-
import scrapy
import re
from urllib import parse
from stock_spider.items import StockItem

class StockSpider(scrapy.Spider):
    name = \'stock\'
    allowed_domains = [\'pycs.greedyai.com/\']
    start_urls = [\'http://pycs.greedyai.com/\']

    def parse(self, response):
        post_urls = response.xpath("//a/@href").extract()
        for post_url in post_urls:
            yield scrapy.Request(url=parse.urljoin(response.url, post_url), callback=self.parse_detail, dont_filter=True)
        pass

    def parse_detail(self, response):
        stock_item = StockItem()
        #董事会成员姓名
        stock_item["names"] = self.get_tc(response)
        #抓取性别信息
        stock_item["sexes"] = self.get_sex(response)
        #抓取年龄信息
        stock_item["ages"] = self.get_age(response)
        #股票代码
        stock_item["codes"] = self.get_code(response)
        #职位信息
        stock_item["leaders"] = self.get_leader(response, len(stock_item["names"]))

        #可以这里在写文件存储逻辑，当然，scrapy框架是让我们写到pipelines中去，但要能在pipelines中处理，就要用到items，在items.py中定义属性
        yield stock_item

    def get_tc(self, response):
        tc_names = response.xpath("//*[@id=\"ml_001\"]/table/tbody/tr[1]/td[1]/a/text()").extract()
        return tc_names

    def get_sex(self, response):
        # //*[@id="ml_001"]/table/tbody/tr[1]/td[1]/div/table/thead/tr[2]/td[1]
        infos = response.xpath("//*[@class=\"intro\"]/text()").extract()
        sex_list = []
        for info in infos:
            try:
                sex = re.findall("[男|女]", info)[0]
                sex_list.append(sex)
            except(IndexError):
                continue
        return sex_list

    def get_age(self, response):
        infos = response.xpath("//*[@class=\"intro\"]/text()").extract()
        age_list = []
        for info in infos:
            try:
                age = re.findall("\d+", info)[0]
                age_list.append(age)
            except(IndexError):
                continue
        return age_list

    def get_code(self, response):
        infos = response.xpath(\'/html/body/div[3]/div[1]/div[2]/div[1]/h1/a/@title\').extract()
        code_list = []
        for info in infos:
            code = re.findall("\d+", info)[0]
            code_list.append(code)
        return code_list

    def get_leader(self, response, length):
        tc_leaders = response.xpath("//*[@class=\"tl\"]/text()").extract()
        tc_leaders = tc_leaders[0:length]
        return tc_leaders

　　items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class StockSpiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class StockItem(scrapy.Item):
    names = scrapy.Field()
    sexes = scrapy.Field()
    ages = scrapy.Field()
    codes = scrapy.Field()
    leaders = scrapy.Field()

　　pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don\'t forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class StockSpiderPipeline(object):
    def process_item(self, item, spider):
        return item

class StockPipeline(object):
    def process_item(self, item, spider):
        print(item)
        return item

十一.数据处理

　　数据处理的相关代码pipelines.py，就是获得数据，按格式写入到文件中。

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don\'t forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import os

class StockSpiderPipeline(object):
    def process_item(self, item, spider):
        return item

class StockPipeline(object):

    def __init__(self):
        # 类被加载时要创建一个文件
        self.file = open("executive_prep.csv", "a+")

    def process_item(self, item, spider):
        #判断文件是否为空，为空写：高管姓名,性别,年龄,股票代码,职位，不为空那么就追加写文件
        if os.path.getsize("executive_prep.csv"):
            #开始写文件
            self.write_content(item)
        else:
            self.file.write("高管姓名,性别,年龄,股票代码,职位\n")
        self.file.flush()
        return item

    def write_content(self, item):
        names = item["names"]
        sexes = item["sexes"]
        ages = item["ages"]
        codes = item["codes"]
        leaders = item["leaders"]

        result = ""
        for i in range(len(names)):
            result = names[i] + "," + sexes[i] + "," + ages[i] + "," + codes[i] + "," + leaders[i] + "\n"
            self.file.write(result)

　　运行main.py，生成executive_prep.csv内容如下：

　　到目前为止，学习了基本的爬虫，整个爬虫用scrapy框架，scrapy框架每个模块是怎么工作的，从数据的抓取，然后数据处理，包括数据持久化（有写到文件中，也有写到数据库中），在这里作为初学者先写到文件里，整个流程串起来了。

　　其实可以把数据保存到数据库中去，如Neo4j数据库，但是格式需要按Neo4j数据库所要求的.csv格式才能导入。

　　neo4j数据库下载地址：https://neo4j.com/download-center/#panel2-3，下载解压后，启动服务：bin/neo4j start，初始用户名/密码neo4j/neo4j，按照提示修改密码。

　　对于neo4j数据库所要求的文件格式要求，请参考：https://neo4j.com/docs/operations-manual/current/tutorial/import-tool/

　　假设我们先爬取到了关联的数据并放到CSV文件中了，且通过数据转换成neo4j数据库所要求的csv格式，可以通过如下命令将所有的数据导入到Neo4j中：

bin/neo4j-admin import --nodes executive.csv --nodes stock.csv --nodes  concept.csv --nodes industry.csv  --relationships executive_stock.csv --relationships stock_industry.csv --relationships stock_concept.csv

　　数据默认存放在 graph.db 文件夹里。如果graph.db文件夹之前已经有数据存在，则可以选择先删除再执行命令。

　　把Neo4j服务重启之后，就可以通过 localhost:7474 观察到知识图谱了。

学习地址：https://ke.qq.com/webcourse/index.html#cid=320330&term_id=100380209&taid=2576310363022154&vid=x1428ovixbh

本文链接：https://www.cnblogs.com/flyingeagle/articles/10665555.html

[Python入门学习]-爬虫项目案例讲解 - bijian1013

[Python入门学习]-爬虫项目案例讲解

[Python入门学习]-爬虫项目案例讲解 - bijian1013的更多相关文章

随机推荐

热门专题

目录导航