First of all, we need to find all the ckan extensions on Github. We're going to look at all the repos having ckanext
in its name. Github Search API has quite strict querying limits so we have to use different techniques to avoid rate limit errors:
$ python code/extract.py
import os
import time
from dotenv import load_dotenv
from frictionless import Resource
from github import Github, RateLimitExceededException
load_dotenv()
PAUSE = 1
RETRY = 10
QUERY = "ckanext in:name stars:>0"
github = Github(os.environ["GITHUB_TOKEN"], per_page=100)
# Source
def search_items():
items = []
results = github.search_repositories(QUERY, sort="stars", order="desc")
time.sleep(PAUSE)
page_number = 0
while True:
try:
page = results.get_page(page_number)
except RateLimitExceededException:
time.sleep(RETRY)
continue
time.sleep(PAUSE)
page_number += 1
if not page:
break
for result in page:
item = {}
item["code"] = "-".join([result.owner.login, result.name])
item["user"] = result.owner.login
item["repo"] = result.name
item["branch"] = result.default_branch
item["stars"] = result.stargazers_count
item["description"] = result.description
items.append(item)
print(f"Found items: {len(items)}")
return items
# General
resource = Resource(search_items())
resource.write("data/extensions.raw.csv")
As a high-level data collections framework, we will use Frictionless Transform. It will sort the packages by repository's stargazers count and save it to the CSV file:
$ python code/transform.py
from frictionless import Resource, transform, steps
# General
transform(
Resource("data/extensions.raw.csv"),
steps=[
steps.table_normalize(),
steps.row_sort(field_names=["stars"], reverse=True),
steps.table_write(path="data/extensions.csv"),
],
)