fonttools/Lib/fontTools/unicodedata/__init__.py

from __future__ import (
    print_function, division, absolute_import, unicode_literals)
from fontTools.misc.py23 import *

from bisect import bisect_right

try:
    # use unicodedata backport compatible with python2:
    # https://github.com/mikekap/unicodedata2
    from unicodedata2 import *
except ImportError:
    # fall back to built-in unicodedata (possibly outdated)
    from unicodedata import *

from .scripts import SCRIPT_RANGES, SCRIPT_NAMES


__all__ = [
    # names from built-in unicodedata module
    "lookup",
    "name",
    "decimal",
    "digit",
    "numeric",
    "category",
    "bidirectional",
    "combining",
    "east_asian_width",
    "mirrored",
    "decomposition",
    "normalize",
    "unidata_version",
    "ucd_3_2_0",
    # additonal functions
    "script",
]


def script(char):
    code = byteord(char)
    # 'bisect_right(a, x, lo=0, hi=len(a))' returns an insertion point which
    # comes after (to the right of) any existing entries of x in a, and it
    # partitions array a into two halves so that, for the left side
    # all(val <= x for val in a[lo:i]), and for the right side
    # all(val > x for val in a[i:hi]).
    # Our 'SCRIPT_RANGES' is a sorted list of ranges (only their starting
    # breakpoints); we want to use `bisect_right` to look up the range that
    # contains the given codepoint: i.e. whose start is less than or equal
    # to the codepoint. Thus, we subtract -1 from the index returned.
    i = bisect_right(SCRIPT_RANGES, code)
    return SCRIPT_NAMES[i-1]
[unicodedata] add new module and 'script' function The new `fontTools.unicodedata` module re-exports all the public functions from the built-in `unicodedata` module, and also adds additional functions. The `script` function takes a unicode character and returns the script name as defined in the UCD "Script.txt" data file. It's implemented as a simple binary search, plus a memoizing decorator that caches the results to avoid search the same character more than once. The unicodedata2 backport is imported if present, otherwise the unicodedata built-in is used. 2017-11-17 19:17:17 +00:00			`from __future__ import (`
			`print_function, division, absolute_import, unicode_literals)`
			`from fontTools.misc.py23 import *`

[unicodedata] use bisect.bisect_right function CPython comes with a fast C implementation of bisect module. This gives 4 to 5 times speed-ups over my pure-python version. 2017-11-20 13:30:17 +01:00			`from bisect import bisect_right`
[unicodedata] add new module and 'script' function The new `fontTools.unicodedata` module re-exports all the public functions from the built-in `unicodedata` module, and also adds additional functions. The `script` function takes a unicode character and returns the script name as defined in the UCD "Script.txt" data file. It's implemented as a simple binary search, plus a memoizing decorator that caches the results to avoid search the same character more than once. The unicodedata2 backport is imported if present, otherwise the unicodedata built-in is used. 2017-11-17 19:17:17 +00:00
			`try:`
			`# use unicodedata backport compatible with python2:`
			`# https://github.com/mikekap/unicodedata2`
			`from unicodedata2 import *`
			`except ImportError:`
			`# fall back to built-in unicodedata (possibly outdated)`
			`from unicodedata import *`

[unicodedata] use bisect.bisect_right function CPython comes with a fast C implementation of bisect module. This gives 4 to 5 times speed-ups over my pure-python version. 2017-11-20 13:30:17 +01:00			`from .scripts import SCRIPT_RANGES, SCRIPT_NAMES`
[unicodedata] add new module and 'script' function The new `fontTools.unicodedata` module re-exports all the public functions from the built-in `unicodedata` module, and also adds additional functions. The `script` function takes a unicode character and returns the script name as defined in the UCD "Script.txt" data file. It's implemented as a simple binary search, plus a memoizing decorator that caches the results to avoid search the same character more than once. The unicodedata2 backport is imported if present, otherwise the unicodedata built-in is used. 2017-11-17 19:17:17 +00:00

			`__all__ = [`
			`# names from built-in unicodedata module`
			`"lookup",`
			`"name",`
			`"decimal",`
			`"digit",`
			`"numeric",`
			`"category",`
			`"bidirectional",`
			`"combining",`
			`"east_asian_width",`
			`"mirrored",`
			`"decomposition",`
			`"normalize",`
			`"unidata_version",`
			`"ucd_3_2_0",`
			`# additonal functions`
			`"script",`
			`]`


			`def script(char):`
			`code = byteord(char)`
[unicodedata] use bisect.bisect_right function CPython comes with a fast C implementation of bisect module. This gives 4 to 5 times speed-ups over my pure-python version. 2017-11-20 13:30:17 +01:00			`# 'bisect_right(a, x, lo=0, hi=len(a))' returns an insertion point which`
			`# comes after (to the right of) any existing entries of x in a, and it`
			`# partitions array a into two halves so that, for the left side`
			`# all(val <= x for val in a[lo:i]), and for the right side`
			`# all(val > x for val in a[i:hi]).`
			`# Our 'SCRIPT_RANGES' is a sorted list of ranges (only their starting`
			# breakpoints); we want to use `bisect_right` to look up the range that
			`# contains the given codepoint: i.e. whose start is less than or equal`
			`# to the codepoint. Thus, we subtract -1 from the index returned.`
			`i = bisect_right(SCRIPT_RANGES, code)`
			`return SCRIPT_NAMES[i-1]`