Node.js module

As a Node module dependency, the engine exposes a JavaScript API that can be called in your own code.

Classes 🔗

Functions 🔗


SourceDocument 🔗

Kind: global class

new SourceDocument(params) 🔗

Represents a source document containing web content and metadata for extraction.
Includes the document location, selectors for content inclusion/exclusion,
content filters, raw content data, and MIME type information.

ParamTypeDescription
paramsobjectThe source document parameters
params.locationstringThe URL location of the document
params.executeClientScriptsbooleanWhether to execute client-side scripts
params.contentSelectorsstring | object | ArrayCSS selectors for content to include
params.insignificantContentSelectorsstring | object | ArrayCSS selectors for content to exclude
params.filtersArrayArray of filters to apply
params.contentstringThe document content
params.mimeTypestringThe MIME type of the content

extract(sourceDocument) ⇒ Promise.<string> 🔗

Extract content from source document and convert it to Markdown

Kind: global function
Returns: Promise.<string> - Promise which is fulfilled once the content is extracted and converted in Markdown. The promise will resolve into a string containing the extracted content in Markdown format

ParamTypeDescription
sourceDocumentstringSource document from which to extract content, see SourceDocument

launchHeadlessBrowser() ⇒ Promise.<puppeteer.Browser> 🔗

Launches a headless browser instance using Puppeteer if one is not already running. Returns the existing browser instance if one is already running, otherwise creates and returns a new instance.

Kind: global function
Returns: Promise.<puppeteer.Browser> - The Puppeteer browser instance.

stopHeadlessBrowser() ⇒ Promise.<void> 🔗

Stops the headless browser instance if one is running. If no instance exists, it does nothing.

Kind: global function

fetch(params) ⇒ Promise.<{mimeType: string, content: (string|Buffer)}> 🔗

Fetch a resource from the network, returning a promise which is fulfilled once the response is available

Kind: global function
Returns: Promise.<{mimeType: string, content: (string|Buffer)}> - Promise containing the fetched resource’s MIME type and content

ParamTypeDescription
paramsobjectFetcher parameters
params.urlstringURL of the resource you want to fetch
[params.executeClientScripts]booleanEnable execution of client scripts. When set to true, this property loads the page in a headless browser to load all assets and execute client scripts before returning its content
[params.cssSelectors]string | ArrayList of CSS selectors to await when loading the resource in a headless browser. Can be a CSS selector or an array of CSS selectors. Only relevant when executeClientScripts is enabled
[params.config]objectFetcher configuration
[params.config.navigationTimeout]numberMaximum time (in milliseconds) to wait before considering the fetch failed
[params.config.language]stringLanguage (in ISO 639-1 format) to be passed in request headers
[params.config.waitForElementsTimeout]numberMaximum time (in milliseconds) to wait for selectors to exist on page before considering the fetch failed. Only relevant when executeClientScripts is enabled