Lisa: Conversational AI

Inspired by PETER STEINER/The New Yorker magazine (1993) “On the internet, nobody knows you’re a dog.” Generated with DALL-E 2 prompt “Dog sitting on a chair in front of a computer with one paw on the keyboard. Comic style, black and white.”

In our last post we discussed the first step in the leasing process driven by Lisa, the Inquiry Parser. Once a message from an Internet Listing Service has been parsed, Lisa’s conversational AI is ready to chat with the prospective resident, either via text or email (we politely decline phone calls 🙂).

Lisa’s “Galaxy Brain”

The primary driver behind Lisa’s conversational AI is what we refer to colloquially as Galaxy Brain or Galaxy (the evolution, and constant intelligence gathering of Brain). Galaxy’s task is framed as multi-label text classification, and it works by converting the conversation into a structured response. We then use this response in Lisa’s logic layer to drive the conversation forward.

The structured response, pictured below, is a set of labels that are accompanied by confidence scores. The labels included in our model are:

  • Intents - Tasks prospective residents want to accomplish

    • “This Thursday works for me”, “Can we do Friday instead?”, “I can no longer make the showing”
      (e.g. accept or counteroffer, reschedule, cancel)

  • Categorical slot values - A piece of information that can be categorized

    •  (e.g. “I’m looking for a 1 bedroom”, “Can I do a virtual showing?”)

  • Requested slots - A piece of information the prospective residents requests from us

    • (e.g. “What's the rent?”, “Do you accept Section 8?”)

  • Acknowledgements - Cordial responses

    • (e.g. “You’re welcome!”, “Thank you for your time.”)

  • Miscellaneous labels - Actions that change the back-end behavior of Lisa

    • (e.g. Mark the thread as spam, have the thread skip the operators’ inbox)

Galaxy Input text (top blue blob) and output response (bottom blue blob). The input text includes current inbound message (red) and conversation history (blue), and the output response includes confidence scores for each label.

The confidence score is a value between 0 and 1 that represents the likelihood that the output of the model is correct, with 1 being the highest. In this instance, because the confidence in SET_TOUR_TYPE_VIRTUAL is high, we would first mark the prospective resident’s preference for virtual tours, and then offer to schedule them a virtual tour. If this score were low, it may be handed off to an operator for review.

While highly accurate, a common problem with deep learning models is that they tend to be overconfident in their predictions. This means that they output a very high (or low) confidence score even if there is high uncertainty associated with the accuracy of the prediction. To adjust for this our model is fit with a set of calibration models, one per label, to map the confidence scores in such a way that they correspond more closely to the probability that the prediction is correct.

For non-categorical slot values, such as names, we use a separate Seq2Seq model similar to Lisa's Inquiry Parser.

NLP models transform the conversation history into a structured conversation state. A logic layer combines the conversation state with information from a Knowledge Base (KB) to compute the next action or response.

How Lisa Responds 

Lisa uses a state-of-the-art Transformers based classifier to map natural language into a structured Conversation State. There is a limit on input text length stemming from the quadratic complexity of the attention mechanism in Transformers, as each token (sub word unit) is queried against all other tokens in the input text. A common limit of Transformers-based models is 512 tokens, and to accommodate for this we simply truncate the beginning of the conversation history, as this portion is typically less relevant to the current turn. Recently, linear attention mechanisms have been developed to greatly increase this length limit, but we haven’t found any significant performance gains.

We also include special tokens indicating the speaker, as well as the local timestamp of when each message was sent. This helps Galaxy, by enabling it to infer information from the pacing of messages, as well as the time of day, day of the week and current month. This can also help overcome ambiguities and aid with prioritizing different parts of the input relevant to the current turn.

We generate confidence scores for each label independent of each other to allow an inbound message to have multiple classifications (i.e. a “multi-label” model). This simple setup also allows us to add new labels without touching the model code, we simply modify the data generation, and the new label will show up after the next retraining.

For example, if a prospect asks “I would like a 2 bedroom and do you accept section 8?”, our model will return a score close to 1 for at least two classes – one for asking about “Section 8” (affordable housing) and another for responding with a “2 bedroom” unit type.

Lisa then interprets this state by combining it with external knowledge to generate the natural language response back to the prospect. We refer to the external knowledge as Lisa’s Knowledge Base (KB), and it includes database lookups (e.g. to determine a property’s Section 8 policy) and API calls to external systems (e.g. to Google Calendar for an agent’s availability).

Here is an example of Galaxy in action. Given a message and its conversation history, Galaxy determines a score for each class. Most classes are irrelevant to this message and thus have very low scores. However, Galaxy identified 2 classes of importance here:

  1. Updating the unit type to 2 bedrooms

  2. The question pertaining to Section 8

When deciding whether the prospect would like a 1 or 2 bedroom apartment, Galaxy paid strong attention to “2 bedroom” in the prospect’s message, but also gave weight to the “1BR” portion in the conversation history. These weights give the class of updating unit type a high score. When judging if there is a Section 8 question, Galaxy gets a strong positive signal from “accept section 8”, but negative signals from the conversations about unit type. This is because prospects don’t tend to mention unit type and Section 8 at the same time. In the end the classifier assigns a positive yet small score to Section 8 class.

Output of the SHAP explainer package for the label SET_UNIT_TYPE_BR2. It shows the importance of each word in the input relating to generating the output score for this label. The colors indicate which parts of the input the model deemed to have positive (red) and negative (blue) contributions. It mostly focuses on the words “2 bedroom” in the last prospect message, but also considers the unit types that Lisa said were available.

Output of the SHAP explainer package for the label SECTION8. It shows the importance of each word in the input relating to generating the output score for this label. The colors indicate which parts of the input the model deemed to have positive (red) and negative (blue) contributions. In this case it mostly focuses on the words “accept section 8?”.

With Lisa’s KB integration we can carry out subsequent actions with our logic layer, such as

  • Mark the desired unit type in the database

  • Answer the question about the Section 8 policy

  • Look up and offer showing slots for the desired unit type

  • Cross sell to a different property if the unit type is not available

The logic layer employs a non-ML, template-based approach to generating responses, instead of letting a ML model decide what template to choose or even generate text end-to-end. We chose this methodology because it gives us more control – without having to re-train the model, we can change how Lisa replies to messages or change the conversation flow, just by making adjustments to the logic. Without this, we would need to retrain operators to continuously correct the model’s behavior until enough data is collected to retrain the model, making iterations slow, error prone, and taxing on operators.

Teaching Lisa

To teach Lisa about the leasing process, we need to collect structured training data from the product – one of the greatest challenges underlying all ML products. We carefully designed Lisa’s logic layer to obtain high-quality data without adding much to operators’ workload. Training a classification model usually requires a labeled dataset, one that has annotated classes for each data point.

In our application, this would mean labeling all the desired classes for each inbound message, several hundred labels per message. Instead of asking our operators to create annotations explicitly, we instead infer labels from their behavior. Our operators’ main job is to reply to prospective residents, and to correct our model’s mistakes if needed. 

We implemented a convenient user interface that can provide structured responses for operators to choose from, so our model can learn directly from what operators do on the job. 

One could say that machines learn what we do, not what we say. The user interface needs to account for different categories of classifications, such as question versus intent, and provide operators with easy ways to generate responses by clicking different buttons or navigating the UI with keyboard shortcuts. 

This machine-human interface blurs the boundaries between machine and human responses. Sometimes the machine bypasses operators entirely, and other times operators ignore the suggestions. However, most of the time, the response lies somewhere in the middle; it could be that the machine gives a strong suggestion and operators simply approve it, or that operators slightly modify it to better suit the conversation flow. 


So are prospective tenants talking to a machine or a human? With Lisa, the line is certainly blurry ¯\_(ツ)_/¯

Authors and contributors: Christfried Focke, Shyr-Shea Chang, Tony Froccaro, Miguel Rivera