Using Concept Frequency-Inverse Concept Document Frequency to Find Conceptual Distinctions in Claim Notes

When working on claims projects at The General®, it’s important to know how claims differ across various dimensions. All claims are different, but there are certain details that cause claims to be handled in a (roughly) uniform way by a specific team. Given the nature of uncertainty in claims, sometimes claims get routed to the wrong team and must be re-routed to the correct destination. One major point of ambiguity that causes claims to be routed incorrectly is that much of the information within a claim file is written as unstructured, unstandardized text by multiple people at different times and with different pieces of information available. But finding common details among seemingly different claims can reveal a common claim route.

By using natural language processing to find these details and patterns in claim notes, the Data Science team can make the pathway of a claim more predictable and efficient. One way we do this is through concept detection.

Say we want to find instances of head injuries—we don’t just want headaches, and we definitely don’t want head-on collisions. So how do we accomplish this? If you’ve been following autoregressed.com then you know that word embeddings have helped us accomplish many of these NLP tasks with much success. And word embeddings are the very foundation of CFIDF, a method I’ve written about before to help determine which concepts are unique to a certain group of data. At The General, we use CFIDF to see which injuries, damages, and accident details are unique to a specific slice of the data.

One way the Data Science can use CFIDF to help triage claims is by determining which injuries and damages are specific to a certain severity class. Our claims department segments claims many ways; one way is by severity class. The severity classes are: minor, moderate, major, life-threatening, and death. And the details of an auto accident vary significantly by these grouping factors.

Let’s take a look at three injury types by severity class using CFIDF:

concept-detection-injuries.png

Here we can see that there’s a clear pattern of more severe claims involving head and leg injuries as compared to neck injuries. When analyzing what concepts the word embedding algorithm has learned is similar to head, we see the following:

[('forehead', 0.6987695693969727),
('face', 0.5426414012908936),
('dash', 0.5411489605903625),
('nose', 0.5350248217582703),
('airbag', 0.5096896290779114),
('chin', 0.5027918815612793),
('dashboard', 0.48118510842323303),
('lip', 0.4772607684135437),
('brain', 0.4757746756076813),
('vehl_airbag', 0.47072070837020874),
('forehead_laceration', 0.4656938314437866),
('mouth', 0.45293664932250977)]

The concepts similar to neck are:

[('neck_shoulders', 0.8123904466629028),

('neck_pain', 0.7298449277877808),

('neck_mid', 0.7143591642379761),

('neck_upperlower', 0.6772769689559937),

('injuriesneck', 0.6220399737358093),

('neck_midlow', 0.5724725127220154),

('stiff_neck', 0.5578615665435791)]

And the concepts similar to leg are:

[('knee', 0.9152631759643555),

('hip', 0.8944814801216125),

('ankle', 0.8675630688667297),

('thumb', 0.8001316785812378),

('foot', 0.7815761566162109),

('thigh', 0.7603033185005188),

('femur', 0.6795393228530884),

('shin', 0.6363328695297241),

('calf', 0.6287088394165039),

('leghip', 0.6087517738342285),

('anklefoot', 0.6073967218399048),

('lower_leg', 0.604314923286438)]

We begin to see why these patterns emerge. Usually a face or brain injury will imply a more severe claim than a stiff_neck so it’s no surprise that head injuries are more unique to more severe claims.

Now let’s take a look at damages. The following graph implies that windshield damages (and damages conceptually similar) are more unique to higher severity claims while bumper damages are not. Let’s see if we can understand why.

concept-detection-damages.png

Here are the concepts most associated with windshield:

[('grill', 0.6117886304855347),

('wshield', 0.60284423828125),

('window', 0.5673037767410278),

('tooth', 0.5466489195823669),

('windows', 0.5371078252792358),

('dashboard', 0.5360897183418274),

('console', 0.5302398204803467),

('brick_wall', 0.5039163827896118),

('dash', 0.4948217272758484),

('radiator', 0.493050754070282),

('steering_wheel', 0.4732726812362671),

('guard_rail', 0.4728657603263855),

('hood', 0.47128745913505554),

('lobe_contusion', 0.45635491609573364),

('wheel', 0.4556693732738495),,

('dash_board', 0.45093899965286255),

('roof', 0.44819214940071106),

('engine', 0.44798600673675537),

('grille', 0.4474671483039856),

('head_lac', 0.44536226987838745),]

And here are the concepts most associated with bumper:

[('rear_bumper', 0.6575539708137512),

('bumper_dented', 0.5856902003288269),

('front_bumper', 0.5853343605995178),

('tail_light', 0.5721505880355835),

('bumper_scratched', 0.5611271262168884),

('hatch', 0.5416772961616516),

('headlight', 0.527552604675293),

('bumper_cover', 0.5189690589904785),

('bumer', 0.5177005529403687),

('bumper_detached', 0.5135642886161804),

('bumper_hanging', 0.5126307010650635),

('small_dent', 0.5055361986160278),

('gas_tank', 0.5001139640808105),

('spasms', 0.4687708020210266),

('dual_exhaust', 0.4562070965766907),

('bumper_tailgate', 0.4542809724807739),

('dent', 0.45353513956069946),

('taillight', 0.4533007740974426),

('bumpertrunk', 0.4512375593185425),

('trunk_lid', 0.4502853751182556),

('painmuscle_spasms', 0.44703739881515503),

('side_mirror', 0.44632619619369507)]

With windshield damages, we get parts of the car that are likely more expensive like radiatorhood, and engine, while for bumper damages we get words like small_dentdent, and headlight which imply less severe details.

The approach outlined here certainly isn’t perfect and we’re constantly discovering new and better ways to analyze claim notes. This method helps us identify what conceptual characteristics of a claim are unique to a certain dimension of the claim data.

Stay tuned for more on how we’re trying to make sense of text data to make customers’ lives easier.

Tim Dobbins