NiceLeeのBlog 用爱发电 bilibili~

Python 使用scrapy框架记录GithubRepo的Star情况

2020-04-14
nIceLee

阅读:


虽说不用框架自己手撸也行,但达不到学习了解scrapy的目的。
先定一个小目标,再在实际使用中学习吧。

前言

  • 想要实现的
    每天记录一次指定Github Repo(此处以BilibiliDown为例)的Star人员账号,
    并上传到指定路径(GithubStargazers/BilibiliDown)

  • 实现步骤

    • 获取Star情况
    • Github文件上传
    • Github workfow周期性调用

获取Star情况

  • 自定义items.py
    就记录序号和用户名,比较简单
    class GithubstarerItem(scrapy.Item):
      # define the fields for your item here like:
      # name = scrapy.Field()
      #序号
      serial_number = scrapy.Field()
      #用户名
      user_name = scrapy.Field()
    
  • 自定义爬虫逻辑 已经有现成的接口,每页最多返回30个,只需要改一下分页的参数就行了?page=%d,当返回为空的时候停止。
    class BilibilidownSpider(scrapy.Spider):
      name = 'BilibiliDown'
      allowed_domains = ['api.github.com']
      start_urls = ['https://api.github.com/repos/nICEnnnnnnnLee/BilibiliDown/stargazers']
      page = 1
        
      def parse(self, response):
          print(response.text)
          result = json.loads(response.text)
          if len(result) == 0:
             return
           
          i = self.page*30 - 30
          for i_user in result:
              starer = GithubstarerItem()
              starer['serial_number'] = i
              starer['user_name'] = i_user['login']
              i += 1 
              print(starer)
              yield starer
                
          # 解析下一页
          self.page += 1
          next_link = 'https://api.github.com/repos/nICEnnnnnnnLee/BilibiliDown/stargazers?page=%d'%self.page
          yield scrapy.Request(next_link, callback=self.parse)
    
  • 自定义pipelines.py 将采集到的item保存为txt文本
    class GithubstarerPipeline(object):
      def process_item(self, item, spider):
          # 获取当前工作目录
          base_dir = os.getcwd()
          fiename = base_dir + '/starers.txt'
          # 从内存以追加的方式打开文件,并写入对应的数据
          with open(fiename, 'a') as f:
              f.write(str(item['serial_number']) + '\t')
              f.write(item['user_name'] + '  \r\n')
          return item
    

    相应配置:

    ITEM_PIPELINES = {
      'GithubStarer.pipelines.GithubstarerPipeline': 300,
    }
    
  • 其它琐碎
    难度比较低,不是很讲究http headers等等,对settings.pymiddlewares.py几乎没有修改。

Github文件上传

在前面已有现成的实现,参考FileUploader4Github
以下为参考的脚本调用:

# 格式化日期
cur_date=$(date "+%Y-%m-%d")
# 上传路径
upload_path="https://api.github.com/repos/nICEnnnnnnnLee/GithubStargazers/contents/BilibiliDown/$cur_date.txt"
# 调用已有的实现上传starers.txt
java -jar tool/FileUploadTool.jar $upload_path starers.txt ${ { secrets.AUTH_TOKEN }}

Github workfow周期性调用

其实没啥难度,要注意的是需要在settings->secrets里面设置好AUTH_TOKEN,用于指定repo的文件上传的授权。

name: CI

on:
   schedule: 
       - cron: '1 0 * * *' # 每天0点1分调用

jobs:
  build:

    runs-on: ubuntu-latest
   
    steps:
    # 检出工程
    - uses: actions/checkout@v2
    # 安装java环境
    - name: Set up JDK 1.8
      uses: actions/setup-java@v1
      with:
        java-version: 1.8
    # 安装python环境
    - name: Set up Python 3.8
      uses: actions/setup-python@v1
      with:
        python-version: 3.8
    # 安装scrapy
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install scrapy
    # 获取Star用户信息,保存到starers.txt
    - name: Get GithubStargazers
      run: |
        rm -rf starers.txt
        scrapy crawl BilibiliDown
    # 上传starers.txt到指定Repo
    - name: Upload GithubStargazers
      run: |
        cur_date=$(date "+%Y-%m-%d")
        upload_path="https://api.github.com/repos/nICEnnnnnnnLee/GithubStargazers/contents/BilibiliDown/$cur_date.txt"
        echo upload_path
        java -jar tool/FileUploadTool.jar $upload_path starers.txt $

内容
隐藏