# Media Observer A data / AI project to capture / analyse the evolution over time of the frontpages of main media sites. The basic process consists of : * finding snapshots of those sites as they were at precise times of the day (e.g. at 8h, 12h, 18h and 22h), * parse those snapshots to extract relevant info (e.g. the main headline), * store that info in a local database, * find semantic similarities within the headlines using language models A basic web UI is available to display the results. At the moment, 6 sites are supported (see them [there](src/media_observer/medias/__init__.py)) but the list will expand over time. None of this would be possible without the incredible [Wayback Machine](http://web.archive.org/) and the volunteers that have helped setup the snapshotting of all those sites for decades. ## Installation First you need to setup a PostgreSQL server and create a database whose path / credentials will be stored in a file `.secrets.toml` with the key `database_url`. ``` database_url="postgresql://user:password@yourdomain.com:port/database_name ``` ### With Rye 1. Install Rye project manager (following instructions from https://rye.astral.sh/guide/installation/) 1. Install dependencies : `rye sync --no-lock --no-dev --all-features` ## Running the project Setup your preferences by updating [the configuration file](./settings.toml) ### With Rye * Do the site snapshots : `rye run snapshots` * Compute the embeddings : `rye run embeddings` * Build the similarity index : `rye run similarity_index` * Run the web server : `rye run web_server`