ARTICLE AD BOX
I am extracting streaming subscriber data from text using an LLM, and I get results like this:
{ "raw_extractions": [ { "platform_mention": "Netflix", "year_mention": "2012", "subscriber_mention": "roughly 30 million subscribers worldwide" }, { "platform_mention": "Netflix", "year_mention": "2020", "subscriber_mention": "just under 200 million" }, { "platform_mention": "Netflix", "year_mention": "2022", "subscriber_mention": "hovered around 220 million subscribers" } ] }I need to convert this into clean time-series data for analysis:
| year | platform | subscribers_min | subscribers_max | confidence | |------|----------|----------------|-----------------|------------| | 2012 | Netflix | 30 | 30 | medium | | 2020 | Netflix | 195 | 200 | medium | | 2022 | Netflix | 220 | 220 | medium |What is the best Python approach to parse fuzzy phrases like "roughly 30 million", "just under 200 million" into numeric ranges?
