Extract data from a myriad of websites with Solve's Web Scraper! This article will guide you through the setup process, provide insights into its capabilities, and address frequently asked questions. As a Solve user, you’ll find this tool invaluable for your data extraction needs and enhancing your chatbot’s capabilities.
What is a Web Scraper?
A Web Scraper is a tool that extracts data from URLs you specify. It streamlines the process of collecting valuable information from websites, which can significantly save you time and effort.
Setting up Solve’s Web Scraper
- Click on the Settings option located on the right side of your dashboard.
- Click on Integrations
- In the search bar, type “Website” to locate the Website (Next-Generation Scraper).
- Click on Website (Next-Generation Scraper)
- Click on Connect to initiate the setup
- Click Edit Settings
- Click Add Item
- Input the URL you want to scrape
- You can add multiple URLs by clicking on Add Item again and inputting additional URLs
- Optionally, check the box next to “Enable Deep Crawling” if you want to scrape more data. When this option is enabled, you will scrape not only the chosen page but also all its related "child" pages. These "child" pages are essentially any pages linked to the original page with URLs starting with the same pattern. The depth of this crawling process has no limits, allowing you to access data from multiple layers of linked pages. This makes it ideal for comprehensive data extraction. However, if this setting is disabled, the Web Scraper will only crawl on the initial page and not go any deeper.
- After you have finished adding your URLs, click Save. The Web Scraper will automatically reindex every 24 hours to ensure the content stays up-to-date.
Capabilities of Web Scraper
|Total Limit - Max number of URLs a single Forethought user can add
|Single URL Limit - Max depth the web scraper will drill down for every added URL (when deep crawling is enabled)
|Single URL Limit - Max depth the web scraper will drill down for every added URL (when deep crawling is disabled)
|Single URL Limit - Max number of pages each URL can scan
|Total Limit - Total max number of pages that one Forethought account can scan
If you'd like to view all the URLs that are crawled, please reach out to your Customer Success Manager.
Frequently Asked Questions (FAQs)
What happens when you disable deep crawling?
- It will only scrape data on the initial page. It won’t crawl deeper into linked children pages.
Does it show all the scanned URLs?
- No, it won’t display the scanned URLs. Instead, it will only show the count or number of pages that have been scanned. You can find this count on the integration page of the next-gen scraper.
How many URLs can you add manually?
- The next-gen scraper has no URL limit you can add, while the old scraper has a limit of 10.
What’s the maximum number of URLs it can index?
- There's no maximum number, allowing for an extensive data extraction.
What does depth mean in Web Scraping?
- Depth refers to how deeply the web scraper will explore a website. For instance, in Solve Widget, the depth level is unlimited, which means that it is capable of delving as deeply as necessary into a website's structure to gather data.
Can you manually remove the URLs that the web scraper scanned?
- No. You can’t manually remove the URLs that the web scraper scanned because it does not display the scanned URLs.