Articles in this section

Extract Data from Multiple Websites with Web Scraper

The Web Scraper connector lets you automatically collect and index content from one or more URLs. It’s ideal for gathering information across multiple websites. Once set up, the Web Scraper updates every 24 hours to ensure your data stays fresh.

How It Works (Step-by-Step)

Step 1: Open Integrations Settings

  • Navigate to Settings > Integrations
    settings>integration.png
  • Click + Connect new integration
  • Search for Website (Next-Generation Scraper)
    website next gen scraper.png

Step 2: Start the setup

  • Click Add Item

    add item.png
  • Enter the URLs of the websites you want to scrape
    URL.png
  • To add more URLs, click Add Item again and enter additional ones

Step 3: (Optional) Enable deep crawling

  • Check the box for Enable Deep Crawling
    enable deep crawling.png
    • Enabled: Collects data from the main page and all child pages linked from it
    • Disabled: Collects data only from the initial URL

💡 Deep crawling is ideal for websites with nested navigation or multi-level documentation.

Step 4: Save the configuration

  • Click Connect to begin configuration
  • The Web Scraper will begin indexing and will reindex automatically every 24 hours

Scraping Authenticated Pages

By default, Web Scraper pulls data from publicly accessible URLs. If you need to scrape content behind a login or authentication wall, please reach out to your Forethought Success Manager.

Web Scraper Capabilities

Capabilities Limits
Maximum number of URLs a single Forethought user can add No limit
When deep crawling is enabled: Maximum depth the web scraper will drill down for each added URL No limit
When deep crawling is disabled: Maximum depth the web scraper will drill down for each added URL 1
Maximum number of pages each URL can scan No limit
Total maximum number of pages that one Forethought account can scan No limit

Q&A

What happens when you disable deep crawling?

It will only scrape data on the initial page. It won’t crawl deeper into linked children pages.

How many URLs can you add manually?

The next-gen scraper has no URL limit you can add.

What’s the maximum number of URLs it can index?

There's no maximum number.

What does depth mean in Web Scraping?

Depth in web scraping refers to how many levels deep the scraper will navigate through a website's structure. For example, if the depth is 3, the scraper will perform the following steps:

  • First Level: It starts at the main URL you provide.
  • Second Level: From the first page, it collects links to other pages and scrapes data from those linked pages.
  • Third Level: It then takes links from the second-level pages and scrapes data from those pages as well.
Was this article helpful?
0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Support

  • Need help?

    Click here to submit a support request. We are here to assist you.

  • Business hours

    Monday to Friday 8am - 5pm PDT excluding US holidays

  • Contact us

    support@forethought.ai