As a Node module dependency, the engine exposes a JavaScript API that can be called in your own code.
extract(sourceDocument)
⇒ Promise.<string>
launchHeadlessBrowser()
⇒ Promise.<puppeteer.Browser>
stopHeadlessBrowser()
⇒ Promise.<void>
fetch(params)
⇒ Promise.<{mimeType: string, content: (string|Buffer), fetcher: string}>
new SourceDocument(params)
Represents a source document containing web content and metadata for extraction.
Includes the document location, selectors for content inclusion/exclusion,
content filters, raw content data, and MIME type information.
Param | Type | Description |
---|---|---|
params | object | The source document parameters |
params.location | string | The URL location of the document |
params.executeClientScripts | boolean | Whether to execute client-side scripts |
params.contentSelectors | string | object | Array | CSS selectors for content to include |
params.insignificantContentSelectors | string | object | Array | CSS selectors for content to exclude |
params.filters | Array | Array of filters to apply |
params.content | string | The document content |
params.mimeType | string | The MIME type of the content |
extract(sourceDocument)
⇒ Promise.<string>
Extract content from source document and convert it to Markdown
Kind: global function
Returns: Promise.<string>
- Promise which is fulfilled once the content is extracted and converted in Markdown. The promise will resolve into a string containing the extracted content in Markdown format
Param | Type | Description |
---|---|---|
sourceDocument | string | Source document from which to extract content, see SourceDocument |
launchHeadlessBrowser()
⇒ Promise.<puppeteer.Browser>
Launches a headless browser instance using Puppeteer if one is not already running. Returns the existing browser instance if one is already running, otherwise creates and returns a new instance.
Kind: global function
Returns: Promise.<puppeteer.Browser>
- The Puppeteer browser instance.
stopHeadlessBrowser()
⇒ Promise.<void>
Stops the headless browser instance if one is running. If no instance exists, it does nothing.
fetch(params)
⇒ Promise.<{mimeType: string, content: (string|Buffer), fetcher: string}>
Fetch a resource from the network, returning a promise which is fulfilled once the response is available
Kind: global function
Returns: Promise.<{mimeType: string, content: (string|Buffer), fetcher: string}>
- Promise containing the fetched resource’s MIME type, content, and fetcher type
Throws:
FetchDocumentError
When the fetch operation failsParam | Type | Description |
---|---|---|
params | object | Fetcher parameters |
params.url | string | URL of the resource you want to fetch |
[params.executeClientScripts] | boolean | Enable execution of client scripts. When set to true , this property loads the page in a headless browser to load all assets and execute client scripts before returning its content. If undefined, the engine will automatically balance performance and tracking success rate, defaulting to not executing scripts and escalating to headless browser if needed |
[params.cssSelectors] | string | Array | List of CSS selectors to await when loading the resource in a headless browser. Can be a CSS selector or an array of CSS selectors. Only relevant when executeClientScripts is enabled |
[params.config] | object | Fetcher configuration |
[params.config.navigationTimeout] | number | Maximum time (in milliseconds) to wait before considering the fetch failed |
[params.config.language] | string | Language (in ISO 639-1 format) to be passed in request headers |
[params.config.waitForElementsTimeout] | number | Maximum time (in milliseconds) to wait for selectors to exist on page before considering the fetch failed. Only relevant when executeClientScripts is enabled |