Creación de un sistema de archivo de correo electrónico: Los retos y, por supuesto, la solución - Parte 1

Feb 4, 2019

Publicado por

Jeff Goldstein

•

Categoría:

Ready to see Bird
in action?

Programe una demostración

Building an Email Archiving System: En Challenges and of Course the Solution – Part 1

Acerca de a year ago I wrote a blog on how to retrieve copies of emails for archival and viewing but I did not broach the actual storing of the email or related data, and recently I wrote a blog on storing all of the event data (i.e. when the email was sent, opens, clicks bounces, unsubscribes, etc) on an email for the purpose of auditing, but chose not to create any supporting code.

Con el aumento del uso del correo electrónico en entornos normativos, he decidido que es hora de iniciar un nuevo proyecto que reúna todo esto con ejemplos de código sobre cómo almacenar el cuerpo del correo electrónico y todos sus datos asociados. Durante el próximo año, seguiré trabajando en este proyecto con el objetivo de crear una aplicación de almacenamiento y visualización para los correos electrónicos archivados y toda la información de registro producida por SparkPost. SparkPost no dispone de un sistema que archive el cuerpo del correo electrónico, pero facilita la creación de una plataforma de archivo.

In this blog series, I will describe the process I went through in order to store the email body onto S3 (Amazon’s Simple Store Service) and all relevant log data in MySQL for easy cross-referencing. Ultimately, this is the starting point for building an application that will allow for easy searching of archived emails, then displaying those emails along with the event (log) data. En code for this project can be found in the following GitHub repository: https://github.com/jeff-goldstein/PHPArchivePlatform

Esta primera entrada de la serie de blogs va a describir el reto y exponer una arquitectura para la solución. El resto de los blogs detallarán partes de la solución junto con ejemplos de código.

El primer paso en mi proceso fue averiguar cómo iba a obtener una copia del correo electrónico enviado al destinatario original. Para obtener una copia del cuerpo del correo electrónico, es necesario:

Capturar el cuerpo del correo electrónico antes de enviarlo
Consigue que el servidor de correo electrónico almacene una copia
Haz que el servidor de correo electrónico cree una copia para que la guardes

Si el servidor de correo electrónico está añadiendo elementos como el seguimiento de enlaces o el seguimiento de aperturas, no puede utilizar #1 porque no reflejará los cambios de seguimiento de aperturas/clics.

Eso significa que o bien el servidor tiene que almacenar el correo electrónico o de alguna manera ofrecer una copia de ese correo electrónico para su almacenamiento. Dado que SparkPost no tiene un mecanismo de almacenamiento para los cuerpos de correo electrónico, pero tiene una manera de crear una copia del correo electrónico, tendremos SparkPost nos envía un duplicado del correo electrónico para nosotros para almacenar en S3.

This is done by using SparkPost’s Archive feature. SparkPost’s Archive feature gives the sender the ability to tell SparkPost to send a duplicate of the email to one or more email addresses and use the same tracking and open links as the original. Documentación de SparkPost defines their Archive feature in the following manner:

Los destinatarios de la lista de archivo recibirán una réplica exacta del mensaje enviado a la dirección RCPT TO. En particular, cualquier enlace codificado destinado al destinatario de RCPT TO será idéntico en los mensajes de archivo.

Las únicas diferencias con respecto al correo electrónico RCPT TO son que algunas de las cabeceras serán distintas, ya que la dirección de destino del correo electrónico de archivo es diferente, pero el cuerpo del correo electrónico será una réplica exacta.

If you want a deeper explanation here is a link a la SparkPost documentation on creating duplicate (or archive) copies of an email.

Como nota al margen, SparkPost en realidad le permite enviar mensajes de correo electrónico a cc, bcc, y las direcciones de correo electrónico de archivo. Para esta solución, nos centramos en las direcciones de archivo.

* Aviso * ¡Los correos electrónicos archivados SÓLO pueden crearse cuando se inyectan correos electrónicos en SparkPost a través de SMTP!

Now that we know how to obtain a copy of the original email, we need to look en el log data that is produced and some of the subtle nuances within that data. SparkPost tracks everything that happens on its servers and offers that information up to you in the form of message-events. Those events are stored on SparkPost for 10 days and can be pulled from the server via a RESTful API called message-events, or you can have SparkPost push those events to any number of collecting applications that you wish. The push mechanism is done through webhooks and is done in real time.

Currently, there are 14 different events that may happen to an email. Here is a list of the current events:

Rebote
ClickDelay
Entrega
Fallo de generación
Rechazo de generación
Apertura inicial
InjectionLink Cancelar suscripción
Darse de baja de la lista
Abrir
Fuera de banda
Política de rechazoDenuncia de spam

* Follow este enlace for an up to date reference guide for a description of each event along with the data that is shared for each event.

Each event has numerous fields that match the event type. Some fields like the transmission_id are found in every event, but other fields may be more event-specific; for example, only open and click events have geotag information.

One very important message event entry to this project is the transmission_id. All of the message event entries for the original email, archived email, and any cc and bcc addresses will share the same transmission_id.

There is also a common entry called the message_id that will have the same id for each entry of the original email and the archived email. Any cc or bcc addresses will have their own id for the message_id entry.

So far this sounds great and frankly fairly easy, but now is the challenging part. Remember, in order to get the archive email, we have SparkPost send a duplicate of the original email to another email address which corresponds to some inbox that you have access to. But in order to automate this solution and store the email body, I’m going to use another feature of SparkPost’s called Retransmisión de correo electrónico entrante. What that does, is take all emails sent to a specific domain and process them. By processing them, it rips the email apart and creates a JSON structure which is then delivered to an application via a webhook. See Appendix A for a sample JSON.

If you look real carefully, you will notice that the JSON structure from the inbound relay is missing a very important field; the transmission_id. While all of the outbound emails have the transmission_id with the same entry which binds all of the data from the original email, archive, cc, and bcc addresses; SparkPost has no way to know that the email captured by the inbound process is connected to any of the outbound emails. The inbound process simply knows that an email was sent to a specific domain and to parse the email. That’s it. It will treat any email sent to that domain the same way, be it a reply from a customer or the archive email send from SparkPost.

Así que el truco es: ¿cómo pegar los datos de salida al proceso de entrada que acaba de tomar la versión archivada del correo electrónico? Decidí ocultar un identificador único en el cuerpo del correo electrónico. Cómo se hace esto depende de usted, pero yo simplemente creé un campo de entrada con la etiqueta oculta activada.

También añadí ese campo en el bloque de metadatos de la cabecera X-MSYS-API que se pasa a SparkPost durante la inyección. Este UID oculto terminará siendo el pegamento de todo el proceso, y es un componente principal del proyecto y será discutido en profundidad en las siguientes entradas del blog.

Ahora que tenemos el UID que unirá este proyecto y entendemos por qué es necesario, puedo empezar a construir la visión del proyecto general y las entradas de blog correspondientes.

Captura y almacenamiento del correo electrónico de archivo junto con una entrada en la base de datos para su búsqueda/indexación.
Captura de todos los datos de eventos de mensajes
Crear una aplicación para ver el correo electrónico y todos los datos correspondientes

He aquí un sencillo diagrama del proyecto:

build an email archiving system - diagram

The first drop of code will cover the archive process and storing the email onto S3, while the second code drop will cover storing all of the log data from message-events into MySQL. You can expect the first two code drops and blog entries sometime in early 2019. If you have any questions or suggestions, please feel free to pass them along.

Feliz Envío.

- Jeff