Python 使用scrapy框架记录GithubRepo的Star情况

前言
获取Star情况
Github文件上传
Github workfow周期性调用

虽说不用框架自己手撸也行，但达不到学习了解scrapy的目的。
先定一个小目标，再在实际使用中学习吧。

前言

想要实现的
每天记录一次指定Github Repo(此处以BilibiliDown为例)的Star人员账号，
并上传到指定路径(GithubStargazers/BilibiliDown)
实现步骤
- 获取Star情况
- Github文件上传
- Github workfow周期性调用

获取Star情况

自定义items.py
就记录序号和用户名，比较简单

class GithubstarerItem(scrapy.Item):
  # define the fields for your item here like:
  # name = scrapy.Field()
  #序号
  serial_number = scrapy.Field()
  #用户名
  user_name = scrapy.Field()

自定义爬虫逻辑已经有现成的接口，每页最多返回30个，只需要改一下分页的参数就行了?page=%d，当返回为空的时候停止。

class BilibilidownSpider(scrapy.Spider):
  name = 'BilibiliDown'
  allowed_domains = ['api.github.com']
  start_urls = ['https://api.github.com/repos/nICEnnnnnnnLee/BilibiliDown/stargazers']
  page = 1
    
  def parse(self, response):
      print(response.text)
      result = json.loads(response.text)
      if len(result) == 0:
         return
       
      i = self.page*30 - 30
      for i_user in result:
          starer = GithubstarerItem()
          starer['serial_number'] = i
          starer['user_name'] = i_user['login']
          i += 1 
          print(starer)
          yield starer
            
      # 解析下一页
      self.page += 1
      next_link = 'https://api.github.com/repos/nICEnnnnnnnLee/BilibiliDown/stargazers?page=%d'%self.page
      yield scrapy.Request(next_link, callback=self.parse)

自定义pipelines.py 将采集到的item保存为txt文本

class GithubstarerPipeline(object):
  def process_item(self, item, spider):
      # 获取当前工作目录
      base_dir = os.getcwd()
      fiename = base_dir + '/starers.txt'
      # 从内存以追加的方式打开文件，并写入对应的数据
      with open(fiename, 'a') as f:
          f.write(str(item['serial_number']) + '\t')
          f.write(item['user_name'] + '  \r\n')
      return item

相应配置：

ITEM_PIPELINES = {
  'GithubStarer.pipelines.GithubstarerPipeline': 300,
}

其它琐碎
难度比较低，不是很讲究http headers等等，对settings.py、middlewares.py几乎没有修改。

Github文件上传

在前面已有现成的实现，参考FileUploader4Github。
以下为参考的脚本调用：

# 格式化日期
cur_date=$(date "+%Y-%m-%d")
# 上传路径
upload_path="https://api.github.com/repos/nICEnnnnnnnLee/GithubStargazers/contents/BilibiliDown/$cur_date.txt"
# 调用已有的实现上传starers.txt
java -jar tool/FileUploadTool.jar $upload_path starers.txt ${ { secrets.AUTH_TOKEN }}

Github workfow周期性调用

其实没啥难度，要注意的是需要在settings->secrets里面设置好AUTH_TOKEN，用于指定repo的文件上传的授权。

name: CI

on:
   schedule: 
       - cron: '1 0 * * *' # 每天0点1分调用

jobs:
  build:

    runs-on: ubuntu-latest
   
    steps:
    # 检出工程
    - uses: actions/checkout@v2
    # 安装java环境
    - name: Set up JDK 1.8
      uses: actions/setup-java@v1
      with:
        java-version: 1.8
    # 安装python环境
    - name: Set up Python 3.8
      uses: actions/setup-python@v1
      with:
        python-version: 3.8
    # 安装scrapy
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install scrapy
    # 获取Star用户信息，保存到starers.txt
    - name: Get GithubStargazers
      run: |
        rm -rf starers.txt
        scrapy crawl BilibiliDown
    # 上传starers.txt到指定Repo
    - name: Upload GithubStargazers
      run: |
        cur_date=$(date "+%Y-%m-%d")
        upload_path="https://api.github.com/repos/nICEnnnnnnnLee/GithubStargazers/contents/BilibiliDown/$cur_date.txt"
        echo upload_path
        java -jar tool/FileUploadTool.jar $upload_path starers.txt $