f17ee64676
Use ScrapingBee for some hosts
2022-07-16 14:09:45 -07:00
2660262c69
Use puppeteer-core
2022-07-15 11:43:55 -07:00
2447bd658e
Use chrome-aws-lambda in GCF
2022-07-15 10:58:58 -07:00
d404cd7c4c
fix comment
2022-07-15 21:41:06 +08:00
1f1698ea81
sync changes to content-fetch-gcf
2022-07-15 15:11:41 +08:00
0cc7e84a82
Fix content not getting parsed by linkedom properly without <html> tag by replacing innerHtml with outerHtml
2022-05-18 15:52:16 +08:00
8f0447ed3f
Stop blocking images and css file
2022-05-18 15:50:52 +08:00
0e31a40331
Use chrome-aws-lambda in the puppeteer GCF
2022-05-13 16:48:51 -07:00
f5003c1370
Stop blocking script
2022-05-13 12:17:19 +08:00
37e55add98
Stop blocking stylesheet and media
2022-05-13 12:09:05 +08:00
60bbbb6cf3
Block requests to 'font', 'image', 'stylesheet', 'script', 'media' in puppeteer
2022-05-12 17:10:38 +08:00
9606cd6b28
Remove chrome-aws-lambda dependencies
2022-05-12 16:32:22 +08:00
0984dca183
Remove adblocker and block resources by url and also block mathJax script
2022-05-11 22:04:47 +08:00
0b11c31317
Add linkedom dependency in packages/api
2022-05-10 18:31:25 +08:00
4c7f6d0281
Update comments
2022-05-09 13:45:45 +08:00
4571f1f51c
Add metrics
2022-05-09 13:45:45 +08:00
21799b7b6d
Add puppeteer-stealth and puppeteer-ad-block plugin and a user-data-dir to reduce processing time
2022-05-09 13:45:45 +08:00
6f29f18743
Parse image and save it in a <img> element
2022-05-05 12:13:08 +08:00
b679451548
Fix parsing articles from www.derstandard.at ( #459 )
...
* Fix parsing articles from www.derstandard.at
* slim cookies down
2022-04-22 10:53:28 +08:00
46b526961a
Dockerize the puppeteer-parse service and add to docker-compose
2022-02-12 13:14:00 -08:00
42836b6b38
Simplify startup of the puppeteer service
...
- Run on port 9090 so we don't conflict with other services
- Route the docker-compose requests to the host network
- Dont require preview bucket information on startup
2022-02-11 14:44:32 -08:00
84f32935f5
Open source omnivore
2022-02-11 09:24:33 -08:00