Explorar o código

Add some proper documentation

jherve hai 1 ano
pai
achega
48734cb46a
Modificáronse 4 ficheiros con 88 adicións e 34 borrados
  1. 4 0
      CHANGELOG.md
  2. 46 0
      ISSUES.md
  3. 26 34
      README.md
  4. 12 0
      settings.toml

+ 4 - 0
CHANGELOG.md

@@ -1,5 +1,9 @@
 # media_observer
 
+## main
+
+* Add proper documentation
+
 ## 0.2.0
 
 * Use version numbers

+ 46 - 0
ISSUES.md

@@ -0,0 +1,46 @@
+## Bugs
+
+### Non-uniqueness of `featured_article_snapshot_id` in `snapshot_apparitions` view
+
+In the `featured_article_snapshot_id` view, the field `featured_article_snapshot_id` is taken as if it was unique by row, but it is not. 
+
+This can be easily checked with this query :
+
+```sql
+SELECT * FROM (
+    SELECT featured_article_snapshot_id, json_group_array(snapshot_id), COUNT(*) as count
+    FROM snapshot_apparitions
+    WHERE is_main -- Not required
+    GROUP BY featured_article_snapshot_id
+)
+WHERE count > 1
+```
+
+Among other things it leads to "deadends" while browsing the UI, likely because the timestamp search and time diff relies on this false assumption.
+
+2024-05-23 : This is likely not relevant anymore now that the URLs include the timestamp and not the snapshot_id.
+
+### Different virtual timestamp, same timestamp
+
+The snapshot process ends up choosing the same snapshot for different virtual timestamps.
+
+This can be checked with this query :
+
+```sql
+SELECT
+    sv.id, sv.site_id, sv2.id, sv2.site_id, sv.timestamp_virtual, sv2.timestamp_virtual, sv2.timestamp
+FROM snapshots_view sv
+CROSS JOIN snapshots_view sv2
+WHERE
+    sv.id != sv2.id
+    and sv.timestamp = sv2.timestamp
+```
+
+### Weird choices of snapshot
+
+Some snapshots are chosen even though they are up to 5/6 hours too early / too late.
+
+```sql
+SELECT timestamp-timestamp_virtual AS difference, * FROM snapshots_view
+ORDER BY difference
+```

+ 26 - 34
README.md

@@ -1,48 +1,40 @@
-# media_observer
+# Media Observer
 
-## Bugs
+A data / AI project to capture / analyse the evolution over time of the frontpages of main media sites.
 
-### Non-uniqueness of `featured_article_snapshot_id` in `snapshot_apparitions` view
+The basic process consists of :
 
-In the `featured_article_snapshot_id` view, the field `featured_article_snapshot_id` is taken as if it was unique by row, but it is not. 
+* finding snapshots of those sites as they were at precise times of the day (e.g. at 8h, 12h, 18h and 22h),
+* parse those snapshots to extract relevant info (e.g. the main headline),
+* store that info in a local database,
+* find semantic similarities within the headlines using language models
 
-This can be easily checked with this query :
+A basic web UI is available to display the results.
 
-```sql
-SELECT * FROM (
-    SELECT featured_article_snapshot_id, json_group_array(snapshot_id), COUNT(*) as count
-    FROM snapshot_apparitions
-    WHERE is_main -- Not required
-    GROUP BY featured_article_snapshot_id
-)
-WHERE count > 1
-```
+At the moment, 6 sites are supported (see them [there](src/media_observer/medias/__init__.py)) but the list will expand over time.
+
+None of this would be possible without the incredible [Wayback Machine](http://web.archive.org/) and the volunteers that have helped setup the snapshotting of all those sites for decades.
 
-Among other things it leads to "deadends" while browsing the UI, likely because the timestamp search and time diff relies on this false assumption.
+## Installation
 
-2024-05-23 : This is likely not relevant anymore now that the URLs include the timestamp and not the snapshot_id.
+First you need to setup a PostgreSQL server and create a database whose path / credentials will be stored in a file `.secrets.toml` with the key `database_url`.
 
-### Different virtual timestamp, same timestamp
+```
+database_url="postgresql://user:password@yourdomain.com:port/database_name
+```
 
-The snapshot process ends up choosing the same snapshot for different virtual timestamps.
+### With Rye
 
-This can be checked with this query :
+1. Install Rye project manager (following instructions from https://rye.astral.sh/guide/installation/)
+1. Install dependencies : `rye sync --no-lock --no-dev --all-features`
 
-```sql
-SELECT
-    sv.id, sv.site_id, sv2.id, sv2.site_id, sv.timestamp_virtual, sv2.timestamp_virtual, sv2.timestamp
-FROM snapshots_view sv
-CROSS JOIN snapshots_view sv2
-WHERE
-    sv.id != sv2.id
-    and sv.timestamp = sv2.timestamp
-```
+## Running the project
 
-### Weird choices of snapshot
+Setup your preferences by updating [the configuration file](./settings.toml)
 
-Some snapshots are chosen even though they are up to 5/6 hours too early / too late.
+### With Rye
 
-```sql
-SELECT timestamp-timestamp_virtual AS difference, * FROM snapshots_view
-ORDER BY difference
-```
+* Do the site snapshots : `rye run snapshots`
+* Compute the embeddings : `rye run embeddings`
+* Build the similarity index : `rye run similarity_index`
+* Run the web server : `rye run web_server`

+ 12 - 0
settings.toml

@@ -1,9 +1,21 @@
 [snapshots]
+# How many days in the past snapshots will be looked for
 days_in_past=3
+# We will attempt to find snapshots that are close to those hours (in local time)
 hours=[8, 12, 18, 22]
 
 [internet_archive]
+# The 2 next settings allow limiting the rate at which requests will be sent to the Internet Archive.
+# In a given interval of limiter_time_period (in seconds), at most limiter_max_rate requests will be sent.
+
+# The `max_rate` setting of AsyncLimiter : https://aiolimiter.readthedocs.io/en/latest/#aiolimiter.AsyncLimiter
 limiter_max_rate=1.0
+# The `time_period` setting of AsyncLimiter : https://aiolimiter.readthedocs.io/en/latest/#aiolimiter.AsyncLimiter
 limiter_time_period=1.0
+
+# Number of seconds during which no request will be sent after a "429 Too Many Requests"
+# HTTP error has been received
 relaxation_time_after_error_429=60
+# Number of seconds during which no request will be sent after encountering a TCP connection
+# error
 relaxation_time_after_error_connect=60