The Web Scraper connector lets you automatically collect and index content from one or more URLs. It’s ideal for gathering information across multiple websites. Once set up, the Web Scraper updates every 24 hours to ensure your data stays fresh.
How It Works (Step-by-Step)
Step 1: Open Integrations Settings
- Navigate to Settings > Integrations
- Click + Connect new integration
- Search for Website (Next-Generation Scraper)
Step 2: Start the setup
- Click Add Item
- Enter the URLs of the websites you want to scrape
- To add more URLs, click Add Item again and enter additional ones
Step 3: (Optional) Enable deep crawling
- Check the box for Enable Deep Crawling
- Enabled: Collects data from the main page and all child pages linked from it
- Disabled: Collects data only from the initial URL
💡 Deep crawling is ideal for websites with nested navigation or multi-level documentation.
Step 4: Save the configuration
- Click Connect to begin configuration
- The Web Scraper will begin indexing and will reindex automatically every 24 hours
Scraping Authenticated Pages
By default, Web Scraper pulls data from publicly accessible URLs. If you need to scrape content behind a login or authentication wall, please reach out to your Forethought Success Manager.
Web Scraper Capabilities
Capabilities | Limits |
Maximum number of URLs a single Forethought user can add | No limit |
When deep crawling is enabled: Maximum depth the web scraper will drill down for each added URL | No limit |
When deep crawling is disabled: Maximum depth the web scraper will drill down for each added URL | 1 |
Maximum number of pages each URL can scan | No limit |
Total maximum number of pages that one Forethought account can scan | No limit |
Q&A
It will only scrape data on the initial page. It won’t crawl deeper into linked children pages.
The next-gen scraper has no URL limit you can add.
There's no maximum number.
Depth in web scraping refers to how many levels deep the scraper will navigate through a website's structure. For example, if the depth is 3, the scraper will perform the following steps:
- First Level: It starts at the main URL you provide.
- Second Level: From the first page, it collects links to other pages and scrapes data from those linked pages.
- Third Level: It then takes links from the second-level pages and scrapes data from those pages as well.
Comments
Article is closed for comments.