30 built-in Python modules you should be using

One feature of Python that has made it so popular is the sheer number and variety of tools that make a developer’s life easier. Modules and libraries help to speed up the development process, saving your team plenty of time to focus on the business logic or other key aspects of your project. If you’re writing something long in Python, be sure to use a text editor to prepare your input for the interpreter - quitting the interpreter means that all your definitions will be lost. Smart developers create their script in a text editor. As your code becomes longer, you’ll need to split it for easier maintenance. That way get to use a specific function in other programs without having to copy its definition into each of them. Python allows developers placing definitions in a file and using them in a script or an interactive instance of the interpreter. That kind of file is called a module; it’s a file that contains Python definitions and statements. Here are 30 built-in Python modules you better be using in your Python project.

1. abc

This tiny module delivers the environment you need for creating abstract base classes. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> from abc import ABCMeta, abstractmethod >>> class MyAbstractClass(metaclass=ABCMeta): ...     @abstractmethod ...     def x(self): ...         pass ...     @abstractmethod ...     def y(self): ...         pass ... >>> obj = MyAbstractClass() Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: Can't instantiate abstract class MyAbstractClass with abstract methods x, y >>> class MyCustomClass(MyAbstractClass): ...     def x(self): ...         return 'x' ... >>> obj = MyCustomClass() Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: Can't instantiate abstract class MyCustomClass with abstract methods y >>> class MyCustomClass(MyAbstractClass): ...     def x(self): ...         return 'x' ...     def y(self): ...         return 'y' ... >>> obj = MyCustomClass() >>> obj.x(), obj.y() ('x', 'y') [/cc]

2. argparse

This module contains tools that make it easier to create user-friendly interfaces from the level of the command line. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import argparse >>> parser = argparse.ArgumentParser() >>> parser.exit = lambda x='', y='': print(x, y)  # We don't want to exit python shell >>> _ = parser.add_argument('foo', help='help text for `foo`') >>> _ = parser.add_argument('-b', '--bar', help='help text for `--bar`') >>> parser.parse_args(['-h']) usage: [-h] [-b BAR] foo foo positional arguments: foo                help text for `foo` foo                help text for `foo` optional arguments: -h, --help         show this help message and exit -b BAR, --bar BAR  help text for `--bar` usage: [-h] [-b BAR] foo foo 2 : error: the following arguments are required: foo, foo Namespace(bar=None, foo=None) >>> parser.parse_args() usage: [-h] [-b BAR] foo foo 2 : error: the following arguments are required: foo, foo Namespace(bar=None, foo=None) >>> parser.parse_args('example.py foo_data'.split(' ')) Namespace(bar=None, foo='foo_data') >>> parser.parse_args('example.py foo_data --bar bar_data'.split(' ')) Namespace(bar='bar_data', foo='foo_data') >>> result = parser.parse_args('example.py foo_data --bar bar_data'.split(' ')) >>> result.foo 'foo_data' >>> result.bar 'bar_data' [/cc]

3. asyncio

This is a very large module that delivers the framework and environment for asynchronous programming. It was added to Python 3.4 as a temporary module and works as an alternative to the popular framework called Twisted. In short, asyncio comes in handy if you want to create asynchronous, concurrent, and single-threaded code. The module launches a loop where the asynchronous code is executed in the form of tass. When one task is inactive (for example, waiting for server response), the module executes another task. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import asyncio >>> async def printer(msg, sleep): ...     await asyncio.sleep(sleep) ...     print (msg) ... >>> task0 = asyncio.ensure_future(printer('task0 -> sleep 5', 5)) >>> task1 = asyncio.ensure_future(printer('task1 -> sleep 1', 1)) >>> task2 = asyncio.ensure_future(printer('task2 -> sleep 3', 3)) >>> future = asyncio.gather(task0, task1, task2) >>> future.add_done_callback(lambda x: asyncio.get_event_loop().stop()) >>> loop = asyncio.get_event_loop() >>> loop.run_forever() task1 -> sleep 1 task2 -> sleep 3 task0 -> sleep 5 >>> future.done() True [/cc]

4. base64

This well-known module delivers a tool for coding and decoding binary code into a format that can be displayed - and the other way round. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import base64 >>> data = base64.b64encode(b'ASCII charakters to be encoded') >>> data b'QVNDSUkgY2hhcmFrdGVycyB0byBiZSBlbmNvZGVk' >>> data = base64.b64decode(data) >>> data b'ASCII charakters to be encoded' [/cc]

5. collections

This module offers specialized container data types that work as an alternative to basic contained types for general purposes such as dict, list, set, and tuple. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> from collections import defaultdict >>> data, ddata = dict(), defaultdict(int) >>> data, ddata ({}, defaultdict(<class 'int'>, {})) >>> ddata['key'] += 5 >>> data['key'] += 5 Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'key' >>> data, ddata ({}, defaultdict(<class 'int'>, {'key': 5})) [/cc]

6. copy

Everyone uses this tiny module that contains tools for deep copying of container type data. Its most famous function is `deepcopy` - if not for this function, copying lists and dictionaries would become a torture for developers. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> from copy import deepcopy >>> data = {'key0': {'key1': False}} >>> cdata = data.copy() >>> data, cdata ({'key0': {'key1': False}}, {'key0': {'key1': False}}) >>> cdata['key0']['key1'] = True >>> data, cdata ({'key0': {'key1': True}}, {'key0': {'key1': True}}) >>> data = {'key0': {'key1': False}} >>> dcdata = deepcopy(data) >>> data, dcdata ({'key0': {'key1': False}}, {'key0': {'key1': False}}) >>> dcdata['key0']['key1'] = True >>> data, dcdata ({'key0': {'key1': False}}, {'key0': {'key1': True}}) [/cc]

7. csv

Delivers functionalities for exporting and importing tabular data in CSV format. The module allows developers to say “load data” or “save data” from to a CSV file. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import csv >>> with open('example.csv', 'w', newline='') as csvfile: ...     writer = csv.writer(csvfile) ...     writer.writerow(['Column A', 'Column B', 'Column C']) ...     writer.writerow(['Cell 0 A', 'Cell 0 B', 'Cell 0 C']) ...     writer.writerow(['Cell 1 A', 'Cell 1 B', 'Cell 1 C']) ... 28 28 28 >>> with open('example.csv', newline='') as csvfile: ...     reader = csv.reader(csvfile) ...     print('\n'.join([' | '.join(row) for row in reader])) ... Column A | Column B | Column C Cell 0 A | Cell 0 B | Cell 0 C Cell 1 A | Cell 1 B | Cell 1 C [/cc]

8. datetime

A simple and one of the most popular modules in Python. It delivers tools that make it easier to work with dates and times. The most popular classes are  `datetime`, `timezone` and `timedelta`, but `date`,`time`, and `tzinfo` can be useful as well. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> from datetime import datetime, timedelta, timezone >>> obj = datetime.now() >>> obj datetime.datetime(2018, 10, 7, 12, 27, 25, 527961) >>> obj += timedelta(days=7) >>> obj datetime.datetime(2018, 10, 14, 12, 27, 25, 527961) >>> obj.astimezone(timezone.utc) datetime.datetime(2018, 10, 14, 10, 27, 25, 527961, tzinfo=datetime.timezone.utc) [/cc]

9. decimal

The module delivers a data type called Decimal. Its main advantage is correct rounding of decimal numbers which is extremely important in billing systems - even a slight distortion with several hundred thousand operations can change the final result significantly. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import decimal >>> float(0.1 + 0.2) 0.30000000000000004 >>> float(decimal.Decimal('0.1') + decimal.Decimal('0.2')) 0.3 [/cc]

10. functools

The functools module comes in handy for higher-order functions, ie. the functions that act on or return other functions. You can treat any callable object as a function for the purposes of this module. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> from functools import wraps >>> def example_decorator(func): ...     @wraps(func) ...     def wrapper(*args, **kwargs): ...         print("Print from decorator") ...         return func(*args, **kwargs) ...     return wrapper ... >>> @example_decorator ... def example_func(): ...     """Func docstring""" ...     print("Print from func") ... >>> example_func() Print from decorator Print from func >>> example_func.__name__ 'example_func' >>> example_func.__doc__ 'Func docstring' >>> from functools import lru_cache >>> @lru_cache() ... def example_func(n): ...     return n * n ... >>> [example_func(n) for n in range(10)] [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] >>> example_func.cache_info() CacheInfo(hits=0, misses=10, maxsize=128, currsize=10) >>> [example_func(n) for n in range(10)] [0, 1, 4, 9, 16, 25, 36, 49, 64, 81] >>> example_func.cache_info() CacheInfo(hits=10, misses=10, maxsize=128, currsize=10) [/cc]

11. hashlib

This handy module implements a common interface to numerous secure hash and message digest algorithms like the FIPS secure hash algorithms SHA1, SHA224, SHA256, SHA384, and SHA512 (defined in FIPS 180-2), as well as RSA’s MD5 algorithm (defined in Internet RFC 1321). Note the terminology: “secure hash” and “message digest” are interchangeable. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import hashlib >>> obj.update(b"Example text for hash") >>> obj.digest() b'\xc0\xf4\xad\x1a\xf5IK\x9a\x17x\xf6"\xfagG\x92\xe1\x0c\x8f\xef\xe2\x99\xfa\x97\x12Q#\xe7Za\xb3\xe2' >>> obj.hexdigest() 'c0f4ad1af5494b9a1778f622fa674792e10c8fefe299fa97125123e75a61b3e2' [/cc]

12. http

This package collects several modules for working with the HyperText Transfer Protocol such as http.client (low-level HTTP protocol client), http.server (includes basic HTTP server classes based on socketserver), http.cookies (with utilities for implementing state management with cookies) and http.cookiejar (that provides persistence of cookies). Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import http.client, http.server, http.cookies, http.cookiejar >>> conn = http.client.HTTPSConnection("sunscrapers.com") >>> conn.request("GET", "/") >>> resp = conn.getresponse() >>> resp.status 200 >>> resp.readlines()[2:5] [b'<!DOCTYPE HTML>\n', b'<html>\n', b'<head>\n'] >>> conn.close() >>> http.HTTPStatus <enum 'HTTPStatus'> >>> http.client <module 'http.client' from '/usr/lib/python3.7/http/client.py'> >>> http.server <module 'http.server' from '/usr/lib/python3.7/http/server.py'> >>> http.cookies <module 'http.cookies' from '/usr/lib/python3.7/http/cookies.py'> >>> http.cookiejar <module 'http.cookiejar' from '/usr/lib/python3.7/http/cookiejar.py'> [/cc]

13. importlib

The package provides the implementation of the import statement (and the __import__() function) in Python source code. The components to implement import are exposed, making it easier for developers to create their own custom objects to participate in the import process. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import importlib >>> importlib.sys <module 'sys' (built-in)> >>> importlib.import_module("this") The Zen of Python, by Tim Peters ... <module 'this' from '/usr/lib/python3.7/this.py'> [/cc]

14. itertools

This useful module implements iterator building blocks inspired by constructs from APL, Haskell, and SML (all in a form that matches Python programs). It standardizes a core collection of quick, memory-efficient tools you can use on their own or in combination to construct specialized tools efficiently in pure Python. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import itertools >>> "-".join(list(itertools.chain("ABC", "DEF"))) 'A-B-C-D-E-F' >>> list(itertools.combinations("ABC", 2)) [('A', 'B'), ('A', 'C'), ('B', 'C')] [/cc]

15. inspect

The module offers several useful functions that help developers get information about live objects such as modules, classes, methods, functions, tracebacks, frame objects, and code objects in Python code. It provides 4 main functions: type checking, getting source code, inspecting classes and functions, and examining the interpreter stack. You can use it to retrieve the source code of a method, examine the contents of a class, or just get all the relevant information to display a detailed traceback. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import inspect >>> inspect.getmodulename(inspect.__file__) 'Inspect' >>> inspect.isfunction(lambda _: _) True >>> inspect.signature(lambda _: _) <Signature (_)> [/cc]

16. json

JSON (JavaScript Object Notation) is a lightweight data interchange format that was inspired by JavaScript object literal syntax. The module json exposes an API that looks similar to the standard library marshal and pickle modules. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import json >>> data = json.dumps({"A": 1, "B": True, "C": [1, 2, 3]}) >>> data '{"A": 1, "B": true, "C": [1, 2, 3]}' >>> json.loads(data) {'A': 1, 'B': True, 'C': [1, 2, 3]} [/cc]

17. logging

This module defines functions and classes that provide a flexible event logging system for applications and libraries. It’s a good idea to use it because it ensures that all Python modules can participate in logging - your application log may include your messages integrated together with messages coming from third-party modules. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import logging >>> FORMAT = '[%(asctime)-15s] [%(name)s] [%(levelname)s] - %(message)s' >>> logging.basicConfig(format=FORMAT) >>> logger = logging.getLogger("MY_LOGGER") >>> logger.error("My error message") [2018-10-29 09:22:10,880] [MY_LOGGER] [ERROR] - My error message [/cc]

18. math

This module gives developers access to the mathematical functions defined by the C standard. You can’t use them with complex numbers, which is a good thing if you’re not looking to learning a lot of mathematics required to understanding complex problems. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import math >>> math.pow(2, 4) 16.0 >>> math.sqrt(16) 4.0 >>> math.floor(5/2) 2 >>> math.ceil(5/2) 3[/cc]

19. multiprocessing

This handy package supports spawning processes with the help of an API similar to the threading module. It provides both local and remote concurrency using subprocesses instead of threads to side-step the Global Interpreter Lock. Developers use it to take full advantage of multiple processors on a given machine. Runs on both Unix and Windows. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> from multiprocessing import Process >>> def my_function(): ...     count = 0 ...     while True: ...         print ("My function: {0}".format(count)) ...         count += 1 ...         time.sleep(1) ... >>> proc = Process(target=my_function, args=()) >>> proc.start() >>> My function: 0 My function: 1 My function: 2 ... [/cc]

20. os

This module offers a portable method of using operating system dependent functionality. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import os >>> os.getpid() 7171 >>> os.name 'posix' >>> os.environ['_'] '/usr/bin/python3.7' >>> os.path.join('dir0', 'dir1', 'file.py') 'dir0/dir1/file.py' [/cc]

21. pdb

This module defines an interactive source code debugger. It supports single stepping at the source line level, setting (conditional) breakpoints, inspection of stack frames, source code listing, and more. Example: [cc lang="JavaScript" escaped="true" lines="100"] slawek@swiegy:/tmp$ echo "print('Hello World')" > test.py slawek@swiegy:/tmp$ python test.py Hello World slawek@swiegy:/tmp$ echo "import pdb;pdb.set_trace()" > test.py slawek@swiegy:/tmp$ python test.py --Return-- > /tmp/test.py(1)<module>()->None -> import pdb;pdb.set_trace() (Pdb) [/cc]

22. random

This useful module implements pseudo-random number generators for various distributions. For example, you get uniform selection from a range for integers and for sequences, there is uniform selection of a random element (as well as a function to generate a random permutation of a list in-place and for random sampling without replacement). Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import random >>> random.randint(0,100) 20 >>> random.randint(0,100) 42 >>> random.random() 0.6609538830840114 >>> random.random() 0.486448371691216 [/cc]

23. re

This module provides regular expression matching operations similar to those you get in Perl. You can search Unicode strings and 8-bit strings, but they can’t be mixed. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import re >>> res = re.search('(?<=abc)def', 'abcdef') >>> res.group(0) 'def' >>> bool(res) True >>> res = re.search('(?<=abc)def', 'xyzdef') >>> bool(res) False [/cc]

24. shutil

The module provides a number of high-level operations on files and collections of files, especially functions that support file copying and removal. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import shutil >>> import os >>> os.listdir() [] >>> open('test.py', 'a+').close() >>> os.listdir() ['test.py'] >>> shutil.copyfile('test.py', 'test2.py') 'test2.py' >>> os.listdir() ['test2.py', 'test.py'] [/cc]

25. sys

The module offers access to variables used or maintained by the interpreter and functions that interact strongly with the interpreter. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import sys >>> sys.argv [''] >>> sys.modules.keys() dict_keys(['sys', 'builtins', '_frozen_importlib', '_imp', '_thread', '_warnings', '_weakref', 'zipimport', '_frozen_importlib_external', '_io', 'marshal', 'posix', 'encodings', 'codecs', '_codecs', 'encodings.aliases', 'encodings.utf_8', '_signal', '__main__', 'encodings.latin_1', 'io', 'abc', '_abc', 'site', 'os', 'stat', '_stat', 'posixpath', 'genericpath', 'os.path', '_collections_abc', '_sitebuiltins', '_bootlocale', '_locale', 'types', 'importlib', 'importlib._bootstrap', 'importlib._bootstrap_external', 'warnings', 'importlib.util', 'importlib.abc', 'importlib.machinery', 'contextlib', 'collections', 'operator', '_operator', 'keyword', 'heapq', '_heapq', 'itertools', 'reprlib', '_collections', 'functools', '_functools', 'zope', 'sitecustomize', 'apport_python_hook', 'readline', 'atexit', 'rlcompleter', 'shutil', 'fnmatch', 're', 'enum', 'sre_compile', '_sre', 'sre_parse', 'sre_constants', 'copyreg', 'errno', 'zlib', 'bz2', '_compression', 'threading', 'time', 'traceback', 'linecache', 'tokenize', 'token', '_weakrefset', '_bz2', 'lzma', '_lzma', 'pwd', 'grp']) >>> sys.path ['', '/usr/lib/python37.zip', '/usr/lib/python3.7', '/usr/lib/python3.7/lib-dynload', '/usr/local/lib/python3.7/dist-packages', '/usr/lib/python3/dist-packages'] [/cc]

26. threading

This useful module builds higher-level threading interfaces on top of the lower level _thread module. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import threading >>> def worker(num): ...     print ("Worker num: {0}".format(num)) ... >>> for i in range(4): ...     threading.Thread(target=worker, args=(i,)).start() ... Worker num: 0 Worker num: 1 Worker num: 2 Worker num: 3 [/cc]

27. types

The module defines utility functions that support the dynamic creation of new types. It defines names for object types used by the standard Python interpreter but not exposed as builtins. It also offers additional type-related utility classes and functions that aren’t builtins. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import types >>> def gen(): ...     yield True ... >>> isinstance(gen(), types.GeneratorType) True >>> import logging >>> logging <module 'logging' from '/usr/lib/python3.7/logging/__init__.py'> >>> types.ModuleType(“logging”) <module 'logging'>[/cc]

28. unittest

Originally inspired by JUnit, the module works similarly to major unit testing frameworks in other programming languages. It supports a broad range of functions: test automation, sharing of setup, shutdown code for tests, test aggregation into collections, and more. Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import unittest >>> class TestStringMethods(unittest.TestCase): ...     def test_example(self): ...         self.assertTrue(True, True) ... >>> unittest.main() . ---------------------------------------------------------------------- Ran 1 test in 0.000s OK[/cc]

29. urllib

This handy package collects several modules for working with URLs: urllib.request (opening and reading URLs), urllib.error (includes exceptions raised by urllib.request), urllib.parse (for parsing URLs), and urllib.robotparser (for parsing robots.txt files). Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> from urllib.parse import urlparse >>> urlparse('https://sunscrapers.com/aboutus/') ParseResult(scheme='https', netloc='sunscrapers.com', path='/aboutus/', params='', query='', fragment='')[/cc]

30. uuid

The module provides immutable UUID objects, as well as the following functions for generating version 1, 3, 4, and 5 UUIDs: uuid1(), uuid3(), uuid4(), uuid5(). Example: [cc lang="JavaScript" escaped="true" lines="100"] >>> import uuid >>> uuid.UUID('{ab4c0103-e305-0668-060a-0a3af12d3a0d}') UUID('ab4c0103-e305-0668-060a-0a3af12d3a0d') >>> uuid.uuid4() UUID('76e1708f-1f12-4034-87cc-7c82f0591c61')[/cc] I hope this Python modules list helps you make the most of Python in your future projects. Naturally, this list is just the tip of the iceberg. For more information about Python built-in modules, head over the the extensive Python documentation. Do you have any questions about Python modules? Give us a shout out in the comments; we’re happy to share our knowledge of this incredibly versatile programming language.

Data warehouses – what they are and how to classify them (Part 3)

This is the third and last post in our series about data warehouses. If you missed the previous ones, check them out here: http://sunscrapers.com/blog/data-warehouses-classify-part-1/ http://sunscrapers.com/blog/data-warehouses-what-they-are-and-how-to-classify-them-part-2/ In our series, we have showed you: What data warehouses are Who needs a data warehouse Tips about implementing and designing data warehouses But bow to make sure that your data warehouse is safe? Here’s the answer.

Security in data warehouses

They say there are two types of people dealing with databases: those who do backups and those who will do them. This well-known phrase is still circulating because it's actually quite accurate. One of the most significant aspects of data warehouses that often gets overlooked or downplayed is the management of historical data and backups. In this part of our series about data warehouses, I wanted to share some security essentials. But before I explain how to take care of security in data warehouses, there's one thing we need to make clear first: never delete anything permanently. Just don't. In theory, everyone who has some experience with Big Data knows that, but you may come across cases where reconstructing the current state of the database or analyzing the process of data creation is impossible.

Data warehousing disasters

One of the most common scenarios that can lead to a disaster is a loss of some or all of the data. That might happen for many different reasons, from a disk array failure to a poorly designed functionality or an error made by a developer. It may also happen that data in our warehouse loses coherence. The recovery of data integrity without any historical data, a changelog, or at least a backup will be challenging – in some cases it might not even be possible! The data we store in our warehouse is going to be constantly updated. In time, we might find ourselves in need of historical data that no longer exists. When creating a data warehouse, remember that the data stored there may be used in the future for other purposes than those we planned initially. It may happen that we want to generate analytical data showing the changes that took place over several months - if we don’t record information about such changes, this won’t be possible. Consider the following examples: We have a website that manages sports bets and we would like to check how users behave in relation to the information published about a team or player. Another example is where we need to check the details of invoices issued during a given moth or when it turns out that something isn’t right in our quarterly settlement. It might happen that we need information about users who logged into our system to verify and resolve any ambiguities that may arise because of their activity.

So how do we protect our data warehouses?

There are basically two good solutions you can choose from when ensuring the safety of data stored in your data warehouse. Regular backups (Dumps) Backup copies are something that should be part of the definition of a data warehouse. Depending on how often our database is updated, we should make automatic backups. It's common to perform a database dump once a day and kept for several months. But in some situations, we may need to perform it more often and keep backups for a more extended period. There are many tools available for creating and storing database dumps – most of them contain compression features that help to save space on the hard drive. Changelog In addition to regular backups, it's a good idea to keep a record of changes for individual atomic information. That allows making corrections to our system and then reproducing the current state of the database with changes taken into account. The downside of this solution is the need to store quite a large amount of data redundancy. Also, the implementation of such a mechanism is not easy, so before we decide to create a changelog, we should ask ourselves whether we really need it. I hope this guide to data warehousing helps you to make the right decision regarding the methods for storing data at your organization. Have you got any questions about data warehouses? Give us a shout out in the comments; we're looking forward to hearing from you. Got a data science project on your mind? Hire us to get top expertise in the field.

Data warehouses – what they are and how to classify them (Part 2)

Here's the second post in our series about data warehouses. If you missed the previous one, check it out here. http://devblog.sunscrapers.com/data-warehouses-classify-part-1/ In our series, we'll show you:
  • What data warehouses are + their short history
  • Who needs a data warehouse
  • Key components of data warehouses
  • Tips about implementing and designing data warehouses
  • How to make sure that your data warehouse is secure
  Ready to learn more about data warehouses and how to implement them? Let's get started.

Designing data warehouses

Structured vs. semi-structured vs. unstructured

  Another important issue we should consider when designing our data warehouse is whether the data we plan to store will be highly structured, partially structured, or unstructured. In general, each of these data types requires a different kind of data warehouse:   Structured Data in a structured data warehouse will always assume the same form, and the data structure will rarely change. Relational database systems are usually used as such warehouses; ones that facilitate the storage of large amounts of data in a way allows finding data we need quickly. The advantage of structured databases is their efficiency in providing us with already pre-transformed data. Downsides? The extension and change of data structure are quite costly and dangerous.   Semi-structured Data warehouses of this type usually contain data that are grouped, categorized and segregated, but the structure of the target data may assume different forms. In the past, two types of databases – the so-called relational and non-relational databases – were used for this type of solution. In a relational database, there used to be mainly an index of possessed data along with information where it's located. In a non-relational database, on the other hand, we would find target data in an unstructured form. Currently, database systems such as MySQL and PostgreSQL provide functionalities that allow storing unstructured data in one table cell in a way to make accessing this data similar to the rest of the structured data (adding a new data type called JSON was the solution).   Unstructured This type of data warehousing contains entirely unstructured data, and one of its applications are ELT systems. To explain this type of warehouse, it's worth first to describe the difference between the ETL and ELT process briefly. The ETL process involves downloading data, transforming it, and saving it to the database (read more about it in this article), while the ELT process consists in downloading data, saving it to the database, and transforming it in real time when it's downloaded. In the case of the ELT process, our data warehouse will get pre-unified but ultimately unprocessed data, so the format and structure of the same data may be different, depending on the source from which they come.   This type of division is quite significant – perhaps not so much from the business point of view, but rather from the technical perspective. The choice between a structured, semi-structured and unstructured data warehouse has a massive impact on the kind of technological solutions we choose for implementing our data warehouse.

Here are some examples of real-world applications:

  Structured A database that contains data about employees, finances, and infrastructure resources is usually structured. Such data warehouses supply HR, as well as logistics and inventory systems. You'll also find them in systems used in banking, healthcare, or telecommunications; that is, in systems that require data reliability and consistency.   Semi-Structured All databases that contain incomplete data, analytical and statistical data, temporary data. You'll find this type of data warehouse in systems based on Machine Learning, used in the Business Intelligence processed or, in general, related to the field of Data Science (Data mining, Visualization, Data processing). These are usually databases that result from a migration and data processing process. They're often just temporary. It can also be an ODS data warehouse.   Unstructured There are usually crawling databases that contain data from many different sources in their original or pre-unified/processed form. Unstructured databases are also created during the ELT process at the Load (L) stage.  

Physical vs. virtual data warehouses

  Looking at the problem of cataloging our data, we can also review solutions such as iPaaS, Data Virtualization or Data Federation. In that case we need to distinguish between physical data and virtual warehouses.   Physical That's a data warehouse with physically stored data. On the one hand, this solution ensures the durability and consistency of data. On the other hand, it also requires more effort to maintain and develop.   Virtual Data in this type of warehouse is metadata, including information about the structure and location of physically stored data. These types of data warehouses are created when instead of building an ETL- or ELT-based system, we pick Data Virtualization or Data Federation solutions.  

Build a data warehouse in accordance with its goal

  When deciding to build a data warehouse, remember that a data warehouse is not an end in itself – instead, it's a means to achieving a goal. That's why it's worth considering what data we will use in our warehouse – and how. Looking at the problem from this perspective, we can divide data warehouses further into:   Subject-Oriented All data in our warehouse will be grouped in relation to subjects of the enterprise – existing entities in our organization. If we save the costs of employing our staff there, the database will be designed to be easily saved and read as "HR." The issues related to human resources form a separate topic in every company – most of the time, organizations have different departments that deal with HR and the same will be reflected in the database.   Integrated Data in integrated data warehouses are unified, and the database itself is quite standardized. To illustrate that better, let's focus on one example: employee salary. In the case of the Subject-Oriented warehouse, that information can be seen in the case of "HR" as "employee -> remuneration." In the case of "Finance," the same data can be seen as "fixed expenses -> employee." In the case of an integrated warehouse, however, these data will be visible always as "Finance -> Expenses -> Employees -> Employee."   Time-variant In this database, data are sorted according to the chronology of events. These are usually historical data saved in such a way that we can easily download and sort them after a while. These can be, for example, transaction data for micro-payments.   Nonvolatile These are read-only data warehouses. That means we can only download data – and not create, update, or delete it.  

Summarized

The data in the warehouse are saved and available as summary information, with the possibility of obtaining details. Here's an example: "Set of all employees -> costs of the software development department -> costs of one software developer." In that case, it's best to first ask about the total cost of employees and then go more in-depth. To sum up, data warehouses can be divided into many different kinds depending on the type and format of data we plan to store, as well as their application and purpose. There are still other ways to classify data warehouses, but the ones mentioned in this series are enough to approach the topic of data warehouse design and implementation successfully. Stay tuned for the third and last part of our series where I talk about different methods for keeping data warehouses secure.

Data warehouses – what they are and how to classify them (Part 1)

Curious about data warehouses? You've reached the right place.   In this article series, we'll show you:
  • What data warehouses are
  • Who needs a data warehouse
  • Tips about implementing and designing data warehouses
  • How to make sure that your data warehouse is safe
  Read to explore the world of data warehousing? Let's dive right into the topic.  

What are data warehouses?

The concept of a data warehouse (DW) is one of the oldest ideas for storing collected data which dates back to the 1980s when two IBM researchers built the first “business data warehouse.”   “Data warehouse” is a rather general concept; that way storing data can be used in different places depending on specific needs. However, when we think about data warehouses, we often associate them with large companies and corporations – hence the term “Enterprise Data Warehouse” (EDW).   In the simplest terms, a data warehouse is a centralized database of integrated data, most often used as the central element of Business Intelligence solutions. To learn more about what makes BI solutions effective, have a look at this article from our blog. The data in a data warehouse is usually unified to a standard form and updated as regularly as possible. A data warehouse also stores historical data – that's something worth remembering when designing a data warehouse.   Why do organizations implement such an architecture in data warehouses? The reason is simple: they rely on information coming from similarly-formatted, current and/or historical data through many different tools. Creating a data warehouse is a smart move because it allows companies to gain a source of the most reliable and current information.  

Who needs a data warehouse?

Data warehouses can be used in many different ways. But these are the people who are particularly appreciative of this solution:
  • Decision-makers who need data for analysis that helps in planning and decision-making.
  • People involved in the optimization of the production process of tangible or intangible goods.
  • Accountants and other finance professionals.
  • Logistics professionals and people responsible for inventory.
  • Staff involved in the monitoring of extensive systems, as well as reporting on and reacting to real-time events.
  • Those involved in the field of data science.
  • Professionals who have to deal with a lot of data from many different sources.
  Before we start to build a data warehouse, it's worth asking: Do we really need it? Storing and managing large data sets can be challenging and maintaining data to match the desired quality is a process that requires continuous effort. When implementing a data warehouse, we should always take costs into account – they may sometimes exceed the benefits of such a solution.  

Classifying warehouses – first step in data warehouse design

When designing our data warehouse, we should think about what for and how the data located there will be used. On that basis, we can distinguish two (or rather, three) types of data warehouses:   Enterprise Data Warehouse (EDW) EDW is a centralized collection of unified, current and often historical data of the enterprise. Its primary role is supporting business decision-making processes. The ETL process (or its newer ELT paradigm) is used to build this type of data warehouse. Read one of my previous articles to learn more about the ETL process. It's also worth noting here that even if the data is current, it's not real-time – for that, we need the second type of data warehouse.   Operational Data Store This type of data warehouse provides data in real time. Its most well-known application is reporting systems, but it can be used for many different things that require data refreshed in real time.   Data Mart A Data Mart is usually a subset of data obtained from a central data warehouse that gathers targeted information, e.g., sales, employee information, or financial. These types of data sets can create their mini data warehouses and get information from external sources, but they usually rely on information taken from their own databases and then transformed accordingly, depending on what they'll be used for.   To help you understand this division of data warehouses, here's an example:

Example data warehousing scenario

Let's imagine that we're an owner of a large chain of grocery stores. To know what's happening in our company, we decide to build a centralized Enterprise Data Warehouse (EDW) that will be used by many different departments at our organization.   For this purpose, we decide to implement a system based on the ETL process that will download information from each store, process it, and then save it to our central database. To simplify that already complicated system, we add an Operational Data Store that contains data that have been unified, but not processed yet.   It's worth noting that in the ETL process, the most time-consuming stage is the processing of extracted data. Downloading and unification of data can be performed in real time – and then used in systems reporting on current status and alarming us when there's an urgent need for interference.   To use data contained in the centralized database efficiently, each department, person or service can create its own small data warehouse (Data Mart) storing data from our central database and external sources.   Want to learn more about data warehouses? Stay tuned for the next part of this article series where we continue talking about data warehouse implementations.

Here’s why Python is so popular in Machine Learning

Machine Learning (ML) is all the rage right now, and organizations that want to take advantage of this technology for their data often turn to Python. There are many reasons why Python is one of the most popular programming languages with developers and engineers who work on ML systems. Here are some good reasons why engineers choose Python for Machine Learning projects.

Simple syntax

First, there's Python's undeniable strength: simple and straightforward syntax. It's one of the most commonly cited reasons behind the popularity of the language in many other areas beyond ML. Note that the semantics of Python often correspond to mathematical ideas that are at the core of Machine Learning. That's why it's easier for engineers to express these ideas with the help of Python – and within relatively few lines of code. Since Python is a dynamically-typed language, it allows skipping a massive amount of material related to low-level tasks and go straight to the point. Developers won’t be losing their nerve on identifying and correcting their mistakes. It’s far more pleasant to read Python code than code writing in Java, C++, or C#. Installing Python and preparing the environment for work is straightforward as well. Learning Python with the help of online resources is just so much easier - the language is just far more understandable.

Gentle learning curve

Another significant advantage of Python is that it's easy to learn. That's another reason for its appeal among developers – today, it's the third most popular programming language, according to the TIOBE index. That means assembling a team of Python experts for your ML project will be easy. As Python enthusiasts, we agree with developers who claim that Python's accessible syntax makes it a much more welcoming and easy to use language than others. The important thing is that Python's simplicity doesn't mean we're trading off on performance. In fact, Python offers a very nice balance of the two, making it an excellent technology for complex Machine Learning projects. And Python’s gentle learning curve is a huge advantage to all those who are taking their first steps in data science. Instead of spending several months learning a new language, they can participate in the project immediately.

Scalability

In comparison to languages like R, Python is far more scalable - and way faster than Matlab or Stata. Python’s scalability comes from the flexibility that it offers to developers in problem-solving. The variety and breadth of Python applications indicate that the language can be used successfully for fast-growing projects.

Amazing community

Python is surrounded by a vibrant community of passionate developers who believe in knowledge sharing and have created plenty of resources for that purpose. For example, developers can join some of these 15 data science Slack communities to access productive discussions about Python in ML or ask questions when in doubt. Since so many people take advantage of Python, the support community is vast, and you can be sure that its collective knowledge will come to rescue whenever your team encounters a problem.

Plenty of ML libraries

Most importantly, Python makes it easy for developers and engineers start working on their projects by providing them with a collection of valuable tools that offer great help in working with machine learning systems. The broad range of frameworks, libraries, and extensions make implementing Machine Learning tasks easier. For example, scikit-learn guides developers in using Python for Machine Learning and Google's TensorFlow helps to build custom ML algorithms. Natural Language Processing (NLP) is another popular area where Python comes in handy - have a look at this article to see the best NLP Python libraries. Apart from these, developers can take advantage of core libraries for data structuring (Pandas, NumPy) and visualizations (Matplotlib, Plotly, Seaborn). And there’s also SciPy, a collection of libraries that are closely related to and sometimes even dependent on one another (SciPy, Scikit-Learn, NumPy, Pandas, Matplotlib). Here are a few libraries every ML enthusiast should know: TensorFlow This general-purpose library helps in building neural networks. Its main advantage hre is the multi-layer nodes system that allows quick learning on large data sets. It’s a real speed demon! TensorFlow was created by Google, and its most famous application is recognizing voice and objects on pictures. Keras This high-level library is for deep learning. You can use Theano or TensorFlow as backend - and even CNTK (Microsoft Cognitive Toolkit). Using Keras, you can easily build a neural network with only basic knowledge about the topic. SciPy SciPy is for carrying out mathematical operations on data matrices. It’s closely connected to NumPy and contains the main modules for linear algebra, statistics, fourier transformation, as well as integration, optimization and processing of images. Scikit-Learn This library from the SciPy Stack set is dedicated to machine learning and image processing. It sets the standard for machine learning in Python, combining ease of use, flexibility, high efficiency and excellent documentation (and high quality of code!). NumPy One of the basic ML libraries in the SciPy Stack. It offers many handy functionalities related to operations on tables and matrices, boosting their efficiency significantly. Pandas Another great library from the SciPy Stack, Pandas is used for carrying out operations on data sets such as adding/removing columns, filling out missing data, creating DataFrame from basic structures (lists, dictionaries), grouping, and aggregation. It helps to carry out complex operations easily and efficiently. Matplotlib Used for visualizing data sets, Matplotlib has many tools to  draw various charts easily and quickly. It’s very useful for presentation of results obtained using ML and visualization of input data which significantly helps to understand the problem that we need to solve and how our models/algorithms work. It’s also part of the the SciPy toolkit. Scrapy A library (or actually a framework) for web crawling that helps to easily obtain data required for further processing. It was created by scrapinghub that has  been professionally acquiring data from websites for many years. Want to learn more? Check out this list for more amazing libraries.

Graphics and visualization

Another area where Python can help is visualization. The language offers a variety of visualization options. The visualization packages help developers to get a better understanding of data, create charts, and create web-ready interactive plots. To see this application in action, check out this post where Alex shows how to use a Python library called Matplotlib for visualization.

Reducing complexity without compromises

Many other languages are used in Machine Learning - for example, Java, C, and Perl. Some developers describe these complex languages are responsible for “hard-coding,” whereas Python figures as a “toy language” that is more accessible to basic users. But in reality, Python is a fully functional alternative to these languages and their often complex syntax. Python is easy and accessible – and that makes collaborative coding and implementation so much easier. Let's face it: your Machine Learning project isn’t going to be developed by an individual, but a group. And building a team of expert Python developers is much easier. As a general-purpose language, Python helps to get a lot done quickly – which brings a lot of value if we consider the general complexity of Machine Learning projects. Still not sure whether Python is the best tool for your Machine Learning project? Get in touch with us; we help companies pick the most promising technologies for their projects.

What the Extract, Transform, and Load process is and how to use it

If you've been watching the data science scene, you probably spotted this term mentioned quite a lot: the Extract, Transform, and Load (ETL) process. ETL is a process that takes advantage of databases and especially in data warehousing. In particular, it’s about extracting data from multiple sources, processing it according to individual requirements, and storing it to databases. Organizations can use such data in many different ways: as databases for different systems, website content, or analytics. The most widespread use case among enterprises is Business Intelligence (BI) solutions. Understanding what it means is essential to know how you can make use of it for your project. In this article, we explain what ETL is and why it's so important.  

What is the Extract, Transform, and Load process?

To put it simply, ETL describes the method for collecting data from various sources and delivering it to further use in a standardized form - all thanks to storing in in databases that form data warehouses. It’s a general description that tells us how that process should be accomplished. Extracting data from different files, transforming it, and then saving to different files is a kind of ETL process too - even though we eventually don’t create a data warehouse but a loose set of files containing information we can use further. There exist many different patterns for building data warehouses apart from ETL. Depending on individual requirements, we can bet on ETL (creating an ETL-based warehouse), but also alternatives such as the Enterprise Service Bus (ESB), Enterprise Application Integration, the new version of ETL called ELT, as well as Data Virtualization, Data Federation, and PaaS.

 

Implementing ETL

ETL systems usually integrate data from multiple systems that are most of the time developed and supported by different vendors. What's more is that these separate systems which contain the original data are often managed and operated by different employees. Take the cost accounting system as an example – it usually combines data coming from sources such as payroll, purchasing, and sales. But note that while implementing such a system looks promising at the beginning, as more data pours in we need more space and the processing stage takes a lot of time. Each change becomes expensive and it takes a long time to get results. Developing and maintaining such a system might become challenging because a single mistake can bring catastrophic consequences. Implementing a system based on ETL in small- and mid-sized companies usually doesn’t make sense - it’s just not very cost-effective. To reduce expenses in this area, companies can try implementing the ELT paradigm - but accomplishing that is quite challenging even for the most experienced developers.

 

The value of ETL

Getting data from various sources, transforming it, applying business rules, loading to the right destinations, and validating the results - each of these steps is a cog in the complex machinery that keeps the correct data flowing within an organization. ETL processes are often complex, and their design is critical to an organization's operational efficiency. After all, they're at the center of every organization’s data management strategy. Note that the three phases of ETL are often executed in parallel. That's because data extraction takes a lot of time and we can't afford to lose any of it. While data is extracted, we can set up a transformation process to execute the data already received and prepare it for loading. The loading process can then start – without waiting for the previous phases to be completed.

 

Why is ETL important?

Information was critical to business success way before computers were invented. For instance, artisans who passed their knowledge from generation to generation made their products more refined - and merchants who had the information about where to buy and sell them made fortunes. As soon as we learned how to write, we started storing information - and technological development caused the amount of data to grow exponentially, forcing us to come up with solutions for processing and analyzing it. It’s not enough to store data today. Businesses need to know how to extract and process it - and that’s where ETL helps. The most common enterprise use case of ETL are systems that store and process data which are critical in the decision-making process - one of the core elements of Business Intelligence. ETL systems allow organizations to collect a massive amount of data and then process and store in a form that allows easy analysis or presentation.

 

Example use case 1

Let’s imagine that we’re a global organization producing clothing. To come up with our business objectives for the following year, we collect data such as price, color, and size about relevant products which are sold by 10 of the most popular online stores. By passing this data through our system, we can see how pricing changed over the years, which color was most popular during a given season, and which sizes were most widespread on the market. We can compare this information with our products and see how that affected our revenue. That type of data is essential for planning - and delivering it is the main job of ETL systems.

 

Example use case 2

Let’s imagine that we’re an investor interested in creating an online platform that acts as an intermediary for purchasing flights. We have met with a company that creates such services and now we know that the approximate cost would be $120 thousand, costing us $30 thousand per month for maintenance. We have also met with several large airlines and know that we could gain $20 per each sold ticket. We acquired historical data about sold tickets and - since we own an intermediary for hotel booking - we know how many people use such platforms. With that knowledge, it seems that we’re prepared to make a decision. But historical data airlines shared with us are saved in completely different formats - while some sent us simple Excel files, others shared an API that allows downloading the information. Our hotel booking system saves statistical data in a way that makes understanding it really difficult. Moreover, we need a list of all airlines together with their phone numbers. To pull all this information together in the right format, we can take advantage of an ETL system. A well-designed ETL process enables organizations to extract data from source systems, improves data quality and consistency standards, and delivers data in a format easy to understand by developers who build applications and other stakeholders who need data to support their decision-making process. Have you got any questions about the ETL process and how to design it to help your organization make better use of its data? Reach out to us in comments; we're always happy to offer advice on proven data engineering practices that comply with industry standards and help organizations take full advantage of their data.

Join our newsletter.