爬取干货集中营的图片

网站地址

http://gank.io/

分析网站

网站地址:https://gank.io/, 要爬取的是网站首页上展示出来的图片

在网站首页底部左上角有API地址,点击进入到干货集中营 API 文档

其中有这样的内容显示,如下图：

尝试访问:https://gank.io/api/data/福利/10/1, 出来如下的json数据：

通过API文档描述,可以知道链接地址中的请求个数和请求页数,获取到所有的图片信息。
比如:https://gank.io/api/data/福利/700/1, 请求700条数据,获取第一页,发现返回的结果中总共有670条数据(截止到20190314),第二页没有数据.
再次请求:https://gank.io/api/data/福利/600/1, 请求600条数据,获取第一页,发现确实有600条数据,
然后再请求第二页(https://gank.io/api/data/福利/600/2),有70条数据。
结合以上分析,截止到今天(20190314),共有670条数据,避免使用翻页的情况,就直接使用如下网址获取全部数据:https://gank.io/api/data/福利/700/1
根据获取到的son数据,提取出所需要的图片链接,最后下载这些图片并保存.

实际代码

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import datetime
import os
from _md5 import md5

import requests
from requests import RequestException

def get_one_page(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.json()
    except:
        return None

def utc_to_local(utc_time_str, utc_format='%Y-%m-%dT%H:%M:%S.%fZ'):
    """
    "2015-06-05T03:54:29.403Z"格式的时间转换成2015-06-05 11:54:29
    """
    local_format = "%Y-%m-%d %H:%M:%S"
    utc_dt = datetime.datetime.strptime(utc_time_str, utc_format)
    local_dt = utc_dt + datetime.timedelta(hours=8)
    time_str = local_dt.strftime(local_format)
    return time_str

def https_to_http(url):
    """
    把图片链接是https的换成http
    """
    if url[0:5] == 'https':
        url = url.replace(url[0:5], 'http')
    return url

def parse_one_page(html):
    items = html['results']
    for item in items:
        yield {
            'id': item['_id'],
            'publishedAt': utc_to_local(item['publishedAt']),
            'url': https_to_http(item['url'])
        }

# 请求图片url,获取图片二进制数据
def download_image(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            save_image(response.content)  # response.contenter二进制数据 response.text文本数据
        return None
    except RequestException:
        print('请求图片出错', url)
        return None

def save_image(content):
    """
    需要提前建好目录D:\\pachong\\gank1\\
    """
    file_path = 'D:\\pachong\\gank1\\{1}.{2}'.format(os.getcwd(), md5(content).hexdigest(), 'jpg')
    if not os.path.exists(file_path):
        with open(file_path, 'wb') as f:
            f.write(content)

def main():
    url = 'http://gank.io/api/data/福利/700/1'
    html = get_one_page(url)
    for item in parse_one_page(html):
        download_image(item['url'])

if __name__ == '__main__':
    main()

结果分析

总共700条json数据,实际只有564条数据,通过分析获取到的图片url,发现有些图片url链接本身已失效,使用浏览器打开这些图片链接会报如下错误:

失效链接展示:

实际结果：