Node.js module

As a Node module dependency, the engine exposes a JavaScript API that can be called in your own code.

Classes

Functions


SourceDocument

Kind: global class

new SourceDocument(params)

Represents a source document containing web content and metadata for extraction.
Includes the document location, selectors for content inclusion/exclusion,
content filters, raw content data, and MIME type information.

ParamTypeDescription
paramsobjectThe source document parameters
params.locationstringThe URL location of the document
params.executeClientScriptsbooleanWhether to execute client-side scripts
params.contentSelectorsstring | object | ArrayCSS selectors for content to include
params.insignificantContentSelectorsstring | object | ArrayCSS selectors for content to exclude
params.filtersArrayArray of filters to apply
params.contentstringThe document content
params.mimeTypestringThe MIME type of the content

extract(sourceDocument)Promise.<string>

Extract content from source document and convert it to Markdown

Kind: global function
Returns: Promise.<string> - Promise which is fulfilled once the content is extracted and converted in Markdown. The promise will resolve into a string containing the extracted content in Markdown format

ParamTypeDescription
sourceDocumentstringSource document from which to extract content, see SourceDocument

launchHeadlessBrowser()Promise.<puppeteer.Browser>

Launches a headless browser instance using Puppeteer if one is not already running. Returns the existing browser instance if one is already running, otherwise creates and returns a new instance.

Kind: global function
Returns: Promise.<puppeteer.Browser> - The Puppeteer browser instance.

stopHeadlessBrowser()Promise.<void>

Stops the headless browser instance if one is running. If no instance exists, it does nothing.

Kind: global function

fetch(params)Promise.<{mimeType: string, content: (string|Buffer), fetcher: string}>

Fetch a resource from the network, returning a promise which is fulfilled once the response is available

Kind: global function
Returns: Promise.<{mimeType: string, content: (string|Buffer), fetcher: string}> - Promise containing the fetched resource’s MIME type, content, and fetcher type
Throws:

  • FetchDocumentError When the fetch operation fails
ParamTypeDescription
paramsobjectFetcher parameters
params.urlstringURL of the resource you want to fetch
[params.executeClientScripts]booleanEnable execution of client scripts. When set to true, this property loads the page in a headless browser to load all assets and execute client scripts before returning its content. If undefined, the engine will automatically balance performance and tracking success rate, defaulting to not executing scripts and escalating to headless browser if needed
[params.cssSelectors]string | ArrayList of CSS selectors to await when loading the resource in a headless browser. Can be a CSS selector or an array of CSS selectors. Only relevant when executeClientScripts is enabled
[params.config]objectFetcher configuration
[params.config.navigationTimeout]numberMaximum time (in milliseconds) to wait before considering the fetch failed
[params.config.language]stringLanguage (in ISO 639-1 format) to be passed in request headers
[params.config.waitForElementsTimeout]numberMaximum time (in milliseconds) to wait for selectors to exist on page before considering the fetch failed. Only relevant when executeClientScripts is enabled