Google Custom Search JSON API allows you to programmatically search the web using Google. It will return similar (but slightly different) results as typical Google search from the browser in JSON format.
Pricing
Google search API is free up until 100 search queries per day. Additional queries cost $5 per 1000 queries, up to 10k queries per day. A billing account must be set up if free quota is exceeded.
API key and programmable search engine
We need 2 things to use Google search API:
an API key, which is a way to identify your client to Google.
a programmable search engine and its associated CX key, which is a way to identify your custom search engine to the API. Multiple custom search engines can be created, each with a different configuration.
Example usage
To use the API, one should send a HTTP Get request to the following address with one’s parameters:
3 parameters are necessary: API key, CX key, query. A typical example in Python is the following:
Returned JSON
For the query “Why always me?” and “5” results, the returned json for me was:
The retrieved information lies within items, and one can retrieve their url with results.json()['items'][item_index]['link'].
Web scraping result urls
To further analyze the content of the results, one can use web scraping tools to extract the source html from the retrieved sites. In Python, for static sites, the BeautifulSoup library is good enough. However for dynamic sites (which is the majority of sites), we may also need the Selenium library to first run the hidden javascript in a headless browser, otherwise you might get near empty responses or something like Please enable JS and disable any ad blocker. A sample code snippet to retrieve the source html of static sites is:
A sample code snippet for dynamic sites is:
Here, the above chrome_options arguments are only needed when using selenium on a no-monitor server environment such as Google Colab. Moreover, one might need to install the necessary packages as so:
Even with selenium, there might be another problem - server side sometimes have anti-bot detection and blocks your access. The library undetected-chromedriver is able to solve this problem. However, by the time of this post, even after a lot of tries I was not able to use this library on Google Colab due to the following error:
WebDriverException: Message: unknown error: cannot connect to chrome at 127.0.0.1:45613
from chrome not reachable