hai 1 ano · 48734cb46a
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,9 @@
 
				 # media_observer
			
 
				 
			
 
				+## main
			
 
				+
			
 
				+* Add proper documentation
			
 
				+
			
 
				 ## 0.2.0
			
 
				 
			
 
				 * Use version numbers
			
--- a/ISSUES.md
+++ b/ISSUES.md
@@ -0,0 +1,46 @@
 
				+## Bugs
			
 
				+
			
 
				+### Non-uniqueness of `featured_article_snapshot_id` in `snapshot_apparitions` view
			
 
				+
			
 
				+In the `featured_article_snapshot_id` view, the field `featured_article_snapshot_id` is taken as if it was unique by row, but it is not. 
			
 
				+
			
 
				+This can be easily checked with this query :
			
 
				+
			
 
				+```sql
			
 
				+SELECT * FROM (
			
 
				+    SELECT featured_article_snapshot_id, json_group_array(snapshot_id), COUNT(*) as count
			
 
				+    FROM snapshot_apparitions
			
 
				+    WHERE is_main -- Not required
			
 
				+    GROUP BY featured_article_snapshot_id
			
 
				+)
			
 
				+WHERE count > 1
			
 
				+```
			
 
				+
			
 
				+Among other things it leads to "deadends" while browsing the UI, likely because the timestamp search and time diff relies on this false assumption.
			
 
				+
			
 
				+2024-05-23 : This is likely not relevant anymore now that the URLs include the timestamp and not the snapshot_id.
			
 
				+
			
 
				+### Different virtual timestamp, same timestamp
			
 
				+
			
 
				+The snapshot process ends up choosing the same snapshot for different virtual timestamps.
			
 
				+
			
 
				+This can be checked with this query :
			
 
				+
			
 
				+```sql
			
 
				+SELECT
			
 
				+    sv.id, sv.site_id, sv2.id, sv2.site_id, sv.timestamp_virtual, sv2.timestamp_virtual, sv2.timestamp
			
 
				+FROM snapshots_view sv
			
 
				+CROSS JOIN snapshots_view sv2
			
 
				+WHERE
			
 
				+    sv.id != sv2.id
			
 
				+    and sv.timestamp = sv2.timestamp
			
 
				+```
			
 
				+
			
 
				+### Weird choices of snapshot
			
 
				+
			
 
				+Some snapshots are chosen even though they are up to 5/6 hours too early / too late.
			
 
				+
			
 
				+```sql
			
 
				+SELECT timestamp-timestamp_virtual AS difference, * FROM snapshots_view
			
 
				+ORDER BY difference
			
 
				+```
			
--- a/README.md
+++ b/README.md
@@ -1,48 +1,40 @@
 
				-# media_observer
			
 
				+# Media Observer
			
 
				 
			
 
				-## Bugs
			
 
				+A data / AI project to capture / analyse the evolution over time of the frontpages of main media sites.
			
 
				 
			
 
				-### Non-uniqueness of `featured_article_snapshot_id` in `snapshot_apparitions` view
			
 
				+The basic process consists of :
			
 
				 
			
 
				-In the `featured_article_snapshot_id` view, the field `featured_article_snapshot_id` is taken as if it was unique by row, but it is not. 
			
 
				+* finding snapshots of those sites as they were at precise times of the day (e.g. at 8h, 12h, 18h and 22h),
			
 
				+* parse those snapshots to extract relevant info (e.g. the main headline),
			
 
				+* store that info in a local database,
			
 
				+* find semantic similarities within the headlines using language models
			
 
				 
			
 
				-This can be easily checked with this query :
			
 
				+A basic web UI is available to display the results.
			
 
				 
			
 
				-```sql
			
 
				-SELECT * FROM (
			
 
				-    SELECT featured_article_snapshot_id, json_group_array(snapshot_id), COUNT(*) as count
			
 
				-    FROM snapshot_apparitions
			
 
				-    WHERE is_main -- Not required
			
 
				-    GROUP BY featured_article_snapshot_id
			
 
				-)
			
 
				-WHERE count > 1
			
 
				-```
			
 
				+At the moment, 6 sites are supported (see them [there](src/media_observer/medias/__init__.py)) but the list will expand over time.
			
 
				+
			
 
				+None of this would be possible without the incredible [Wayback Machine](http://web.archive.org/) and the volunteers that have helped setup the snapshotting of all those sites for decades.
			
 
				 
			
 
				-Among other things it leads to "deadends" while browsing the UI, likely because the timestamp search and time diff relies on this false assumption.
			
 
				+## Installation
			
 
				 
			
 
				-2024-05-23 : This is likely not relevant anymore now that the URLs include the timestamp and not the snapshot_id.
			
 
				+First you need to setup a PostgreSQL server and create a database whose path / credentials will be stored in a file `.secrets.toml` with the key `database_url`.
			
 
				 
			
 
				-### Different virtual timestamp, same timestamp
			
 
				+```
			
 
				+database_url="postgresql://user:password@yourdomain.com:port/database_name
			
 
				+```
			
 
				 
			
 
				-The snapshot process ends up choosing the same snapshot for different virtual timestamps.
			
 
				+### With Rye
			
 
				 
			
 
				-This can be checked with this query :
			
 
				+1. Install Rye project manager (following instructions from https://rye.astral.sh/guide/installation/)
			
 
				+1. Install dependencies : `rye sync --no-lock --no-dev --all-features`
			
 
				 
			
 
				-```sql
			
 
				-SELECT
			
 
				-    sv.id, sv.site_id, sv2.id, sv2.site_id, sv.timestamp_virtual, sv2.timestamp_virtual, sv2.timestamp
			
 
				-FROM snapshots_view sv
			
 
				-CROSS JOIN snapshots_view sv2
			
 
				-WHERE
			
 
				-    sv.id != sv2.id
			
 
				-    and sv.timestamp = sv2.timestamp
			
 
				-```
			
 
				+## Running the project
			
 
				 
			
 
				-### Weird choices of snapshot
			
 
				+Setup your preferences by updating [the configuration file](./settings.toml)
			
 
				 
			
 
				-Some snapshots are chosen even though they are up to 5/6 hours too early / too late.
			
 
				+### With Rye
			
 
				 
			
 
				-```sql
			
 
				-SELECT timestamp-timestamp_virtual AS difference, * FROM snapshots_view
			
 
				-ORDER BY difference
			
 
				-```
			
 
				+* Do the site snapshots : `rye run snapshots`
			
 
				+* Compute the embeddings : `rye run embeddings`
			
 
				+* Build the similarity index : `rye run similarity_index`
			
 
				+* Run the web server : `rye run web_server`
			
--- a/settings.toml
+++ b/settings.toml
@@ -1,9 +1,21 @@
 
				 [snapshots]
			
 
				+# How many days in the past snapshots will be looked for
			
 
				 days_in_past=3
			
 
				+# We will attempt to find snapshots that are close to those hours (in local time)
			
 
				 hours=[8, 12, 18, 22]
			
 
				 
			
 
				 [internet_archive]
			
 
				+# The 2 next settings allow limiting the rate at which requests will be sent to the Internet Archive.
			
 
				+# In a given interval of limiter_time_period (in seconds), at most limiter_max_rate requests will be sent.
			
 
				+
			
 
				+# The `max_rate` setting of AsyncLimiter : https://aiolimiter.readthedocs.io/en/latest/#aiolimiter.AsyncLimiter
			
 
				 limiter_max_rate=1.0
			
 
				+# The `time_period` setting of AsyncLimiter : https://aiolimiter.readthedocs.io/en/latest/#aiolimiter.AsyncLimiter
			
 
				 limiter_time_period=1.0
			
 
				+
			
 
				+# Number of seconds during which no request will be sent after a "429 Too Many Requests"
			
 
				+# HTTP error has been received
			
 
				 relaxation_time_after_error_429=60
			
 
				+# Number of seconds during which no request will be sent after encountering a TCP connection
			
 
				+# error
			
 
				 relaxation_time_after_error_connect=60