Written in Livemark
(2021-09-28 09:37)

Data Collection

Extract

First of all, we need to find all the ckan extensions on Github. We're going to look at all the repos having ckanext in its name. Github Search API has quite strict querying limits so we have to use different techniques to avoid rate limit errors:

$ python code/extract.py
import os
import time
from dotenv import load_dotenv
from frictionless import Resource
from github import Github, RateLimitExceededException


load_dotenv()
PAUSE = 1
RETRY = 10
QUERY = "ckanext in:name stars:>0"
github = Github(os.environ["GITHUB_TOKEN"], per_page=100)


# Source


def search_items():
    items = []
    results = github.search_repositories(QUERY, sort="stars", order="desc")
    time.sleep(PAUSE)
    page_number = 0
    while True:
        try:
            page = results.get_page(page_number)
        except RateLimitExceededException:
            time.sleep(RETRY)
            continue
        time.sleep(PAUSE)
        page_number += 1
        if not page:
            break
        for result in page:
            item = {}
            item["code"] = "-".join([result.owner.login, result.name])
            item["user"] = result.owner.login
            item["repo"] = result.name
            item["branch"] = result.default_branch
            item["stars"] = result.stargazers_count
            item["description"] = result.description
            items.append(item)
        print(f"Found items: {len(items)}")
    return items


# General


resource = Resource(search_items())
resource.write("data/extensions.raw.csv")

Transform

As a high-level data collections framework, we will use Frictionless Transform. It will sort the packages by repository's stargazers count and save it to the CSV file:

$ python code/transform.py
from frictionless import Resource, transform, steps


# General


transform(
    Resource("data/extensions.raw.csv"),
    steps=[
        steps.table_normalize(),
        steps.row_sort(field_names=["stars"], reverse=True),
        steps.table_write(path="data/extensions.csv"),
    ],
)
A livemark listing CKAN extensions hosted on Github