Regular Download Links
Regular downloads occur when the file link is directly available within the HTML (typically in thehref
of an <a>
tag). Clicking these links directly initiates a file download.
To handle these downloads:
- Save the URL directly from the page.
- Reworkd will then asynchronously visit and download the file. We use
curl-cffi
mimicking browser behavior when downloading the file.
Indirect Download Links
Indirect downloads happen when the direct link isn’t immediately visible but becomes available after clicking a button or link. To handle indirect downloads:- Click the button/link to open the URL.
- Capture and save the newly loaded URL.
- Automatically navigate back.
JavaScript/Dynamic Downloads
Dynamic downloads occur when a file download is triggered by JavaScript events directly in the browser, without a direct URL. To handle dynamic downloads:- Use
capture_download
method to trigger and capture the download directly in the browser. - Retrieve the file metadata (URL and title).
Downloads Requiring Cookies/Session
Some sites require the download to occur within the same browser session that accessed the page, making AWS Lambda unsuitable. In these cases:- Follow the same approach as dynamic downloads, handling the download directly in the browser context using
capture_download
.