Handling File Downloads
Different types of file downloads require different code strategies. This page outlines various strategies you may take.
Regular Download Links
Regular downloads occur when the file link is directly available within the HTML (typically in the href
of an <a>
tag). Clicking these links directly initiates a file download.
To handle these downloads:
- Save the URL directly from the page.
- Reworkd will then asynchronously visit and download the file. We use
curl-cffi
mimicking browser behavior when downloading the file.
Indirect Download Links
Indirect downloads happen when the direct link isn’t immediately visible but becomes available after clicking a button or link.
To handle indirect downloads:
- Click the button/link to open the URL.
- Capture and save the newly loaded URL.
- Automatically navigate back.
JavaScript/Dynamic Downloads
Dynamic downloads occur when a file download is triggered by JavaScript events directly in the browser, without a direct URL.
To handle dynamic downloads:
- Use
capture_download
method to trigger and capture the download directly in the browser. - Retrieve the file metadata (URL and title).
Downloads Requiring Cookies/Session
Some sites require the download to occur within the same browser session that accessed the page, making AWS Lambda unsuitable.
In these cases:
- Follow the same approach as dynamic downloads, handling the download directly in the browser context using
capture_download
.