As a Node module dependency, the engine exposes a JavaScript API that can be called in your own code.
extract(sourceDocument)
⇒ Promise.<string>
launchHeadlessBrowser()
⇒ Promise.<puppeteer.Browser>
stopHeadlessBrowser()
⇒ Promise.<void>
fetch(params)
⇒ Promise.<{mimeType: string, content: (string|Buffer)}>
new SourceDocument(params)
🔗Represents a source document containing web content and metadata for extraction.
Includes the document location, selectors for content inclusion/exclusion,
content filters, raw content data, and MIME type information.
Param | Type | Description |
---|---|---|
params | object | The source document parameters |
params.location | string | The URL location of the document |
params.executeClientScripts | boolean | Whether to execute client-side scripts |
params.contentSelectors | string | object | Array | CSS selectors for content to include |
params.insignificantContentSelectors | string | object | Array | CSS selectors for content to exclude |
params.filters | Array | Array of filters to apply |
params.content | string | The document content |
params.mimeType | string | The MIME type of the content |
extract(sourceDocument)
⇒ Promise.<string>
🔗Extract content from source document and convert it to Markdown
Kind: global function
Returns: Promise.<string>
- Promise which is fulfilled once the content is extracted and converted in Markdown. The promise will resolve into a string containing the extracted content in Markdown format
Param | Type | Description |
---|---|---|
sourceDocument | string | Source document from which to extract content, see SourceDocument |
launchHeadlessBrowser()
⇒ Promise.<puppeteer.Browser>
🔗Launches a headless browser instance using Puppeteer if one is not already running. Returns the existing browser instance if one is already running, otherwise creates and returns a new instance.
Kind: global function
Returns: Promise.<puppeteer.Browser>
- The Puppeteer browser instance.
stopHeadlessBrowser()
⇒ Promise.<void>
🔗Stops the headless browser instance if one is running. If no instance exists, it does nothing.
fetch(params)
⇒ Promise.<{mimeType: string, content: (string|Buffer)}>
🔗Fetch a resource from the network, returning a promise which is fulfilled once the response is available
Kind: global function
Returns: Promise.<{mimeType: string, content: (string|Buffer)}>
- Promise containing the fetched resource’s MIME type and content
Param | Type | Description |
---|---|---|
params | object | Fetcher parameters |
params.url | string | URL of the resource you want to fetch |
[params.executeClientScripts] | boolean | Enable execution of client scripts. When set to true , this property loads the page in a headless browser to load all assets and execute client scripts before returning its content |
[params.cssSelectors] | string | Array | List of CSS selectors to await when loading the resource in a headless browser. Can be a CSS selector or an array of CSS selectors. Only relevant when executeClientScripts is enabled |
[params.config] | object | Fetcher configuration |
[params.config.navigationTimeout] | number | Maximum time (in milliseconds) to wait before considering the fetch failed |
[params.config.language] | string | Language (in ISO 639-1 format) to be passed in request headers |
[params.config.waitForElementsTimeout] | number | Maximum time (in milliseconds) to wait for selectors to exist on page before considering the fetch failed. Only relevant when executeClientScripts is enabled |