暂无描述

jherve 5ca6864b59 Use an enum for column types 1 年之前
src 5ca6864b59 Use an enum for column types 1 年之前
static d8e8323043 Fix grid for articles published at same time 1 年之前
templates c05f653a97 Add link to the frontpage snapshot 1 年之前
tests 1c91c4bfe6 Initial commit 1 年之前
.gitignore d51ff03bb2 Switch to Annoy for vector search/indexing 1 年之前
CHANGELOG.md ac4ab6b937 Improve layout on mobile 1 年之前
ISSUES.md 48734cb46a Add some proper documentation 1 年之前
README.md 085a82b5b1 Update README 1 年之前
config.py caa854cba8 Format some file 1 年之前
install-git-hooks.sh d120dfd089 Add a script to install commit git hooks 1 年之前
pyproject.toml 6d305a47bb Simplify database schema 1 年之前
requirements-dev.lock 6d305a47bb Simplify database schema 1 年之前
requirements-embeddings.lock 5d907e27a8 Add a requirements file for embeddings 1 年之前
requirements.lock 6d305a47bb Simplify database schema 1 年之前
settings.toml 48734cb46a Add some proper documentation 1 年之前

README.md

Media Observer

A data / AI project to capture / analyse the evolution over time of the frontpages of main media sites.

A live version is available here : http://18.171.236.162:8000/

(Please forgive the ugly UI 🥹 ! You'd better use it on desktop, though there is an almost readable mobile version)

* Hosted on an AWS free-tier EC2 instance + "12 months free" RDS database, managed with Terraform

What is this ?

This project aims at observing what subjects news medias put forward on their websites.

The basic process consists of :

  • finding snapshots of those sites as they were at precise times of the day (e.g. at 8h, 12h, 18h and 22h),
  • parse those snapshots to extract relevant info (e.g. the main headline),
  • store that info in a local database,
  • find semantic similarities within the headlines using language models

A basic web UI is available to display the results.

At the moment, 6 sites are supported (see them there) but the list will expand over time.

None of this would be possible without the incredible Wayback Machine and the volunteers that have helped setup the snapshotting of all those sites for decades.

Installation

First you need to setup a PostgreSQL server and create a database whose path / credentials will be stored in a file .secrets.toml with the key database_url.

database_url="postgresql://user:password@yourdomain.com:port/database_name

With Rye

  1. Install Rye project manager (following instructions from https://rye.astral.sh/guide/installation/)
  2. Install dependencies : rye sync --no-lock --no-dev --all-features

Running the project

Setup your preferences by updating the configuration file

With Rye

  • Do the site snapshots : rye run snapshots
  • Compute the embeddings : rye run embeddings
  • Build the similarity index : rye run similarity_index
  • Run the web server : rye run web_server