How to parse an html page with JavaScript in python 3?

Question

How to parse an html page with JavaScript in python 3?

How to parse a html page from JavaScript to python 3 and what is needed for this.

11

English javascript python python-3.x

Author: MaxU, 2017-11-26

Source

1 answers

score 15 · Accepted Answer

To get static data from html, javascript text, you can use the appropriate parsers, such as BeautifulSoup, slimit. Example: How can I use Beautiful Soup to search for a keyword if this word is in the script tag?

To get information from a web page whose elements javascript dynamically generates, you can use a web browser. To manage different browsers from Python, selenium webdriver helps: example showing the GUI. There are other libraries, for example: marionette (firefox), pyppeteer (chrome, puppeteer API for Python) - example of getting a screenshot of a web page using these libraries. To get an html page without showing the GUI, you can "headless" Google Chrome and run it using selenium:

from selenium import webdriver  # $ pip install selenium

options = webdriver.ChromeOptions()
options.add_argument('--headless')
# get chromedriver from 
# https://sites.google.com/a/chromium.org/chromedriver/downloads
browser = webdriver.Chrome(chrome_options=options)

browser.get('https://ru.stackoverflow.com/q/749943')
# ... other actions
generated_html = browser.page_source
browser.quit()

This interface allows you to automate user actions (pressing keys, buttons, searching for items on the screen). page by various criteria, etc.). It is useful to divide the analysis into two parts: download the dynamically generated information from the network using the browser and save it (there may be redundant information), and then analyze the already static content in detail to remove only the necessary parts (perhaps without the network in another process using the same BeautifulSoup). For example, to find links to similar questions on a saved page:

from bs4 import BeautifulSoup

soup = BeautifulSoup(generated_html, 'html.parser')
h = soup.find(id='h-related')
related = [a['href'] for a in h.find_all('a', 'question-hyperlink')]

If the site provides an API (official or spied in network requests executed by javascript: example for fifa.com), then this may be a more preferable option compared to pulling information from the UI elements of a web page: example, using the Stack Exchange API.

You can often find REST API or GraphQL API, which are convenient to use with requests or specialized libraries (see the links for code examples for github api).