Python爬虫实践1

项目001：爬取本地html文件

要求：爬取评分高于4分的文章标题和分类

解决步骤：

step1、用BeautifulSoup解析网页

step2、查找需要用的tag

step3、提取tag中有用的信息

代码如下：

from bs4 import BeautifulSoup

data = []
path = '1_2/1_2code_of_video/web/new_index.html' #vscode的相对路径是对于工程目录而非当前目录

#用with open() as f的方法比f.open()好，因为前者封装了f.close()，省去了清除内存的麻烦。
with open(path, 'r') as f:
    Soup = BeautifulSoup(f.read(), 'lxml')
    
    #这里用select()和find_all()都行
    titles = Soup.select('ul > li > div.article-info > h3 > a')
    pics = Soup.select('ul > li > img')
    descs = Soup.select('ul > li > div.article-info > p.description')
    rates = Soup.select('ul > li > div.rate > span')
    cates = Soup.select('ul > li > div.article-info > p.meta-info')

    
#因为要通过一个tag值来判断，随后返回该tag值对应的另一tag值
#所以这里用dict来处理
for title, pic, desc, rate, cate in zip(titles, pics, descs, rates, cates):
    info = {
        'title': title.get_text(),
        'pic': pic.get('src'),
        'descs': desc.get_text(),
        'rate': rate.get_text(),
        'cate': list(cate.stripped_strings)
    }
    data.append(info)

for i in data:
    if len(i['rate']) >= 3:
        print(i['title'], i['cate'])

最后输出

Sardinia's top 10 beaches ['fun', 'Wow']
How to get tanned ['butt', 'NSFW']
How to be an Aussie beach bum ['sea']
Summer's cheat sheet ['bay', 'boat', 'beach']

其余笔记

1、

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment . ——官方手册

2、select()和find_all()

#find_all()的返回值类型
<class 'bs4.element.ResultSet'>
#select()的返回值类型
<class 'list'>

返回的都是包含标签的列表。

参考资料

[1] with open() as f和open()的区别

[2] vscode文件路径问题