Lately I've found this vocabulary.com site that give a lot of knowledge. Let's check how can we automate reading of new vocabulary by scraping data from site into pdf.

To The Point

Scraping data from vocabulary.com

First what we need is to gather links from vocabulary.com at Choose Your Words.

Script like that should do that:

import requests
from bs4 import BeautifulSoup

domain_href = 'https://www.vocabulary.com/'
output = requests.get('{}articles/chooseyourwords/'.format(domain_href))
bsobj = BeautifulSoup(output.text, 'html.parser')
distinct_links = {}
for link in bsobj.find_all('a'):
    href = link.get("href")
    if href:
        if "chooseyourwords" in href:
            distinct_links[href] = domain_href + href
print(distinct_links.keys())

After gathering unique links we need to disassemble it.

def disassemble_vocabulary_page(link):
    output = requests.get(link)
    bsobj = BeautifulSoup(output.text, 'html.parser')
    return bsobj.find('div', class_="articlebody")

The script in one piece looks like this:

import requests
from bs4 import BeautifulSoup

def get_vocabulary_links():
    domain_href = 'https://www.vocabulary.com/'
    output = requests.get('{}articles/chooseyourwords/'.format(domain_href))
    bsobj = BeautifulSoup(output.text, 'html.parser')
    distinct_links = {}
    for link in bsobj.find_all('a'):
        href = link.get("href")
        if href:
            if "chooseyourwords" in href:
                distinct_links[href] = domain_href + href
    return distinct_links.values()

def disassemble_vocabulary_page(link):
    output = requests.get(link)
    bsobj = BeautifulSoup(output.text, 'html.parser')
    return bsobj.find('div', class_="articlebody")


if __name__ == "__main__":
    link_to_vocabulary = {}
    links = get_vocabulary_links()
    for link in links:
        link_to_vocabulary[link] = disassemble_vocabulary_page(link)

Saving html data into pdf

Let's use pdfkit to make new pdf. It's using wkhtmltopdf, so first we need to install it with:

sudo apt-get install wkhtmltopdf

Then we can use pipenv install pdfkit to make pdfkit library available at env.

Making use of your html that we have scraped with this script:

def html_to_pdf(link, data):
    pdfkit.from_string(data, "{}.pdf".format(link.split("chooseyourwords")[-1]))

Source code

The output is not yet most prettiest, but still we have data in a pdf.

import requests
import pdfkit
from bs4 import BeautifulSoup

def get_vocabulary_links():
    domain_href = 'https://www.vocabulary.com/'
    output = requests.get('{}articles/chooseyourwords/'.format(domain_href))
    bsobj = BeautifulSoup(output.text, 'html.parser')
    distinct_links = {}
    for link in bsobj.find_all('a'):
        href = link.get("href")
        if href:
            if "chooseyourwords" in href:
                distinct_links[href] = domain_href + href
    return distinct_links.values()

def disassemble_vocabulary_page(link):
    output = requests.get(link)
    bsobj = BeautifulSoup(output.text, 'html.parser')
    return bsobj.find('div', class_="articlebody")

def html_to_pdf(link, data):
    file_name = link.split("chooseyourwords")[-1].replace("/", "")
    pdfkit.from_string(data, "{}.pdf".format(file_name))

if __name__ == "__main__":
    link_to_vocabulary = {}
    links = get_vocabulary_links()
    for link in links:
        html_to_pdf(link, disassemble_vocabulary_page(link).text)

Acknowledgements

Auto-promotion

Related links

Thanks!

That's it :) Comment, share or don't - up to you.

Any suggestions what I should blog about? Post me a comment in the box below or poke me at Twitter: @anselmos88.

See you in the next episode! Cheers!



Comments

comments powered by Disqus