Không có mô tả

jherve 48734cb46a Add some proper documentation		1 năm trước cách đây
src	52b7d4b4de Change the project name	1 năm trước cách đây
static	3501a02749 Add logo information and display it in the UI	1 năm trước cách đây
templates	70d9705643 [fix] URLs now use the virtual_timestamp instead of an ID	1 năm trước cách đây
tests	1c91c4bfe6 Initial commit	1 năm trước cách đây
.gitignore	d51ff03bb2 Switch to Annoy for vector search/indexing	1 năm trước cách đây
CHANGELOG.md	48734cb46a Add some proper documentation	1 năm trước cách đây
ISSUES.md	48734cb46a Add some proper documentation	1 năm trước cách đây
README.md	48734cb46a Add some proper documentation	1 năm trước cách đây
config.py	caa854cba8 Format some file	1 năm trước cách đây
install-git-hooks.sh	d120dfd089 Add a script to install commit git hooks	1 năm trước cách đây
pyproject.toml	91fa1e33ef Bump to version 0.2.0	1 năm trước cách đây
requirements-dev.lock	cf34257719 Extend duration display to similar articles	1 năm trước cách đây
requirements-embeddings.lock	5d907e27a8 Add a requirements file for embeddings	1 năm trước cách đây
requirements.lock	cf34257719 Extend duration display to similar articles	1 năm trước cách đây
settings.toml	48734cb46a Add some proper documentation	1 năm trước cách đây

Media Observer

A data / AI project to capture / analyse the evolution over time of the frontpages of main media sites.

The basic process consists of :

finding snapshots of those sites as they were at precise times of the day (e.g. at 8h, 12h, 18h and 22h),
parse those snapshots to extract relevant info (e.g. the main headline),
store that info in a local database,
find semantic similarities within the headlines using language models

A basic web UI is available to display the results.

At the moment, 6 sites are supported (see them there) but the list will expand over time.

None of this would be possible without the incredible Wayback Machine and the volunteers that have helped setup the snapshotting of all those sites for decades.

Installation

First you need to setup a PostgreSQL server and create a database whose path / credentials will be stored in a file .secrets.toml with the key database_url.

database_url="postgresql://user:password@yourdomain.com:port/database_name

With Rye

Install Rye project manager (following instructions from https://rye.astral.sh/guide/installation/)
Install dependencies : rye sync --no-lock --no-dev --all-features

Running the project

Setup your preferences by updating the configuration file

With Rye

Do the site snapshots : rye run snapshots
Compute the embeddings : rye run embeddings
Build the similarity index : rye run similarity_index
Run the web server : rye run web_server

README.md

Media Observer

Installation

With Rye

Running the project

With Rye