twitter-text-python¶
This is a Python port of the twitter/twitter-text libraries, fully compliant with the official conformance test suite.
Features¶
This library calculates length of a tweet message according to the documentation from Twitter Developers, so that you can validate the tweet without calling the Web API at all. Although counting characters might seem an easy task, in actual fact it is very complicated, especially when the text contains CJK characters, URLs, or emojis.
The original twitter-text libraries have hit-highlighting and auto-linking features as well, however they are not yet supported by this Python port.
Usage¶
Installation¶
$ pip install twitter-text-parser
Examples¶
See the API reference for more details.
from twitter_text import parse_tweet, extract_emojis_with_indices, extract_urls_with_indices
text = 'english text 日本語 😷 https://example.com'
assert parse_tweet(text).asdict() == {
'weightedLength': 46,
'valid': True,
'permillage': 164,
'validRangeStart': 0,
'validRangeEnd': 38,
'displayRangeStart': 0,
'displayRangeEnd': 38
}
assert extract_urls_with_indices(text) == [{
'url': 'https://example.com',
'indices': [19, 38]
}]
assert extract_emojis_with_indices(text) == [{
'emoji': '😷',
'indices': [17, 18]
}]
API References¶
-
class
twitter_text.
ParsedResult
(valid: bool, weightedLength: int, permillage: int, validRangeStart: int, validRangeEnd: int, displayRangeStart: int, displayRangeEnd: int)¶
-
twitter_text.
parse_tweet
(text: str, options: dict = {'default_weight': 200, 'emoji_parsing_enabled': True, 'max_weighted_tweet_length': 280, 'ranges': [{'start': 0, 'end': 4351, 'weight': 100}, {'start': 8192, 'end': 8205, 'weight': 100}, {'start': 8208, 'end': 8223, 'weight': 100}, {'start': 8242, 'end': 8247, 'weight': 100}], 'scale': 100, 'transformed_url_length': 23, 'version': 3}) → twitter_text.parse_tweet.ParsedResult¶ Parse a Twitter text according to https://developer.twitter.com/en/docs/developer-utilities/twitter-text
Parameters: - text (str) – A text to parse.
- options (dict) –
Parameters for counting the weighted tweet length. This must have the following properties:
- max_weighted_tweet_length (int)
- Valid tweet messages must not exceed this weighted length.
- default_weight (int)
- Default weight to cover code points not defined in the
ranges
. - ranges (list of dict)
- A list of Unicode code point ranges, with a weight associated with each of these ranges.
Each element of
ranges
must have the following attributes:- start (int)
- end (int)
- weight (int)
- scale (int)
- The weights are divided by
scale
. - emoji_parsing_enabled (bool)
- When set to
True
, it counts an emoji consisting of multiple Unicode code points as a single character, resulting in a visually intuitive weighted length. - transformed_url_length (int)
- The default length assigned to all URLs.
Return ParsedResult: An object having the following properties:
- weightedLength (int)
The weighted length of the twitter text.
Each Unicode character (or URL, emoji) in
text
is assigned an integer weight, which is summed over to calculate weightedLength.- valid (bool)
True if the
text
is valid, i.e.,weightedLength <= max_weighted_tweet_length
text
does not contain invalid characters.
- permillage (int)
Equals to
weightedLength // max_weighted_tweet_length * 1000
.- displayRangeStart (int)
Always 0.
- displayRangeEnd (int)
Number of UTF-16 code units in
text
, subtracted by one.- validRangeStart (int)
Always 0.
- validRangeEnd (int)
Number of UTF-16 code units in the valid part of
text
, subtracted by one.The “valid part” here means the longest valid Unicode substring starting from the leftmost of
text
.
Example:
>>> parse_tweet('english text 日本語 😷 https://example.com') ParsedResult( weightedLength=46, valid=True, permillage=164, validRangeStart=0, validRangeEnd=38, displayRangeStart=0, displayRangeEnd=38 )
-
twitter_text.
extract_urls
(text: str, extract_urls_without_protocol: bool = True) → List[str]¶ Extract valid URLs present in
text
.>>> extract_urls('http://twitter.com/これは日本語です。example.com中国語') ["url": "http://twitter.com/", "example.com"]
-
twitter_text.
extract_urls_with_indices
(text: str, extract_urls_without_protocol: bool = True) → List[dict]¶ Extract valid URLs present in
text
along with their Unicode code point indices.>>> extract_urls_with_indices('http://twitter.com/これは日本語です。example.com中国語') [ { "url": "http://twitter.com/", "indices": [0, 19] }, { "url": "example.com", "indices": [28, 39] } ]
-
twitter_text.
extract_emojis_with_indices
(text: str) → List[dict]¶ Extract emojis present in
text
along with their Unicode code point indices.>>> extract_emojis_with_indices('text 😷') {'emoji': '😷', 'indices': [5, 6]}
>>> extract_emojis_with_indices('🙋🏽👨🎤') [{'emoji': '🙋🏽', 'indices': [0, 2]}, {'emoji': '👨🎤', 'indices': [2, 5]}]