Commit Graph

178 Commits

Author SHA1 Message Date
706ab6075d double capacity of content-fetch worker 2024-08-23 12:23:22 +08:00
eec2734f5a keep up to 1 hour complete tasks and up to 1 day failed tasks 2024-08-22 15:27:53 +08:00
f0865654dc fix: allow content-fetch to still process http requests 2024-08-22 15:14:41 +08:00
aa2caca944 do not cache content from https://jacksonh.org 2024-08-21 18:47:23 +08:00
0366c426bc process up to 2 jobs concurrently 2024-08-21 17:58:58 +08:00
6de285432d fix importer status not updated if failed to fetch content 2024-08-21 16:53:50 +08:00
72d89308c5 process up to 10 jobs concurrently 2024-08-21 12:24:35 +08:00
0fbc6d0a87 update importer to use content-fetch queue 2024-08-21 12:24:35 +08:00
5bd272dde0 use one queue with different priority for fetching content of rss feed item or saved url 2024-08-21 12:24:35 +08:00
d29ac109cb slow and rss queue process 5 tasks/s and normal queue process 100 tasks per second 2024-08-21 12:24:35 +08:00
34edbeba56 fix dockerfile 2024-08-21 12:24:35 +08:00
08fbb8aebf use different queues for fast,slow and rss content fetch jobs 2024-08-21 12:24:35 +08:00
87b4ec503e enqueue content-fetch task to the queue 2024-08-21 12:24:35 +08:00
e3eae1c96c create a worker to process content-fetch job 2024-08-21 12:24:35 +08:00
4674321531 reduce blocking domain to 1 hour 2024-08-18 12:37:10 +08:00
322f736fe0 stop storing original html in the database 2024-07-31 19:14:38 +08:00
0e0c4bddac block failed domains 2024-07-24 16:55:50 +08:00
31fe4b65a0 remove readability from content-fetch 2024-07-24 12:53:41 +08:00
29a5b20d2c remove scrapingbee from content-fetch 2024-07-24 12:17:13 +08:00
75338f5927 bypass cloudflare captcha 2024-07-10 14:43:47 +08:00
73e180f43d add more dependencies to docker container 2024-07-09 19:16:21 +08:00
c75cbb39d6 injecting webgl fingerprint 2024-07-09 14:11:31 +08:00
dd01202374 do not cache some urls 2024-07-05 19:46:18 +08:00
728059c6f8 do not cache some urls 2024-07-05 19:05:36 +08:00
b38b28c75e create a browser singleton instance and checks browser existence before creating context 2024-07-04 19:12:42 +08:00
bbc7b5e600 use @omnivore/utils in import-handler 2024-07-03 22:20:27 +08:00
59c826fd5e use @omnivore/utils in content-fetch 2024-07-03 21:58:22 +08:00
f2ff4b7b0a fix: only send content_fetch_failure event to analytics 2024-05-31 12:44:01 +08:00
fc9d5c64ec do not fail if cache missed 2024-05-17 17:27:34 +08:00
6f2aa2e0cd add more logs 2024-05-17 17:19:55 +08:00
52ebf466e3 get content from cache first when saving url 2024-05-17 16:46:54 +08:00
9c3d619ad5 put locale and timezone in cache key 2024-05-17 16:22:20 +08:00
dde9f16396 put error message in the analytic event 2024-05-17 16:16:44 +08:00
f3ce6f4d4e catch content fetch result in redis 2024-05-17 15:55:28 +08:00
efb9b6b139 add source to the content_fetch event 2024-05-17 14:54:46 +08:00
9dee510be1 fix rss 2024-05-14 20:18:18 +08:00
cce5f2463d still use redis for cache 2024-05-14 17:16:26 +08:00
04ba62977e fix rebase conflicts 2024-05-14 17:14:41 +08:00
e093c9e096 fix comment 2024-05-14 17:14:41 +08:00
3e925e0193 update comment 2024-05-14 17:14:41 +08:00
5bd157ca25 hash url as the key 2024-05-14 17:14:41 +08:00
7a0b2f3d33 upload file only not exists 2024-05-14 17:14:41 +08:00
9286174ec7 upload and download original content from GCS 2024-05-14 17:14:40 +08:00
33e1c4dd00 remove flush method from analytics class 2024-05-13 19:10:14 +08:00
7634ed667f capture total time of fetching a page 2024-05-13 17:01:52 +08:00
f64bd4700f update analytic event details 2024-05-13 15:18:04 +08:00
a924c8448b capture content-fetch success and error events 2024-05-13 14:55:48 +08:00
0c0a95a79c fix newsletter dir not saved correctly 2024-04-24 21:10:13 +08:00
824b256d20 fix memory leak from axios error 2024-04-24 15:55:54 +08:00
7f441b4ff3 dedupe save-page job 2024-04-23 21:44:25 +08:00