Web-Crawler is an open source system for managing and extracting data across multiple hosts sites, providing basic mechanisms for file management, logging, and compression of applications data.


web.project.web_crawler module

Web Crawler designed to find products and ratings for products developed and targeting seniors.

class project.web_crawler.WebCrawler(url=None, about=None, sub_url=None, page=None, data=None, clean=False)[source]

Bases: object

Web Crawler

cleanup()[source]

Clean up csv files in the current directory, and saves them to csv folder. Returns:

self.clean: bool - file cleaned.
compress()[source]

Compresses files received from web-crawler.

csv_to_database()[source]

Returns extracted csv data to an SQL database. Returns:

self.clean: bool - file cleaned.
data_extract()[source]

Extract the url page data and parses the information with BeautifulSoup

get_data()[source]

Get the data that the webcrawler is parsing Returns:

self.data: string - page data.
Example:
>>> example_data = crawler.get_data()
												
get_description()[source]

Get the description of the product, located within the targeted web page. Returns:

self.about: string - description of product.
Example:
>>> example_description = crawler.get_description()
												
get_nav_categories()[source]

Get the categories parsed within the webcrawler. Returns:

self.categories: list - list of categories within the navigation bar.
Example:
>>> example_categories = crawler.get_nav_categories()
												

Get the category links within the webcrawler. Returns:

self.catlinks: list - list of category links within the navigation bar.
Example:
>>> example_catlinks = crawler.get_nav_catlinks()
												
get_page()[source]

Gets the page that the webcrawler is parsing data from. Returns:

self.page: string - the page of the url.
Example:
>>> example_page = crawler.get_page()
												
get_sub_url()[source]

Gets the url that the webcrawler will be accessing. Returns:

url: string - the url.
Example:
>>> example_url = crawler.get_url()
												
get_url()[source]

Gets the url that the webcrawler will be accessing. Returns:

url: string - the url.
Example:
>>> example_url = crawler.get_url()
												
log_cleanup()[source]

Clean up log files in the current directory, and saves them to log folder. Returns:

self.clean: bool - file cleaned.
open_log()[source]

Creates log file for web-crawler.

sub_data_extract()[source]

Extract the url page data and parses the information with BeautifulSoup