Quantcast
Channel: Active questions tagged email - Stack Overflow
Viewing all articles
Browse latest Browse all 29745

Any smart way or package in R for extracting mail bodies from emails?

$
0
0

I have a data frame, including one column of emails. One value in this column is a email chain. For example, one of them is:

the html format is:

 <p>Dear aaa,</p> \r\n
  <p>&nbsp;</p> \r\n
  <p>this is mail body from mail 1. <br /><br /><br />Regards </p>\r\n
  <p> </p>\r\n
  <p> </p>\r\n
  <div></div>\r\n
  <div>
   <br />
   <br />
   <br />
  </div>\r\n
  <div align="left">
   \r\n
   <hr />\r\n
   <font face="Tahoma" size="1"><b>From:</b> aaa <br /><b>Sent:</b> 5/12/2007 01:31:52 PM GST (GMT +04:00)<br /><b>To:</b> bbb <b>:</b> ccc, ddd, eee <br /><b>Mail Number:</b> 111 <br /><b>Subject:</b> xxxxxxx </font>
   <br />
   <br />
   <br />
  </div>\r\n
  <p>aaa, </p>\r\n
  <p>&nbsp;</p>\r\n
  <p>this is mail body from mail 2</p>\r\n
  <p>&nbsp;</p>\r\n
  <p>Regards, </p>\r\n
  <p>&nbsp;</p>\r\n
  <div style="\&quot;MARGIN:" 0in="" 0pt\"="">
   <strong><span style="\&quot;FONT-SIZE:" 14pt;="" color:="" #3366ff;="" font-family:="" 'viner="" hand="" itc'\"="">aaa</span></strong>
  </div>\r\n
  <div style="\&quot;MARGIN:" 0in="" 0pt\"="">
   <b><span style="\&quot;FONT-SIZE:" 10pt;="" color:="" #17365d;="" font-family:="" verdana\"="">Logistics Manager</span></b>
  </div>\r\n
  <div style="\&quot;MARGIN:" 0in="" 0pt\"="">
   <b><span style="\&quot;FONT-SIZE:" 10pt;="" color:="" #17365d;="" font-family:="" verdana\"="">Project Services</span></b>
  </div>\r\n
  <div style="\&quot;MARGIN:" 0in="" 0pt\"="">
   &nbsp;
  </div>\r\n
  <div style="\&quot;MARGIN:" 0in="" 0pt\"="">
   <b><span style="\&quot;COLOR:" gray;="" font-family:="" verdana\"="">aaa</span></b>
   <b><span style="\&quot;COLOR:" #999999;="" font-family:="" verdana\"=""> bbb </span></b>
  </div>\r\n
  <div style="\&quot;MARGIN:" 0in="" 0pt\"="">
   <span style="\&quot;FONT-SIZE:" 8pt;="" color:="" #17365d;="" font-family:="" verdana\"="">P.O. Box: 111</span>
  </div>\r\n
  <div style="\&quot;MARGIN:" 0in="" 0pt\"="">
   <span style="\&quot;FONT-SIZE:" 8pt;="" color:="" #17365d;="" font-family:="" verdana\"="">The Project</span>
  </div>\r\n
  <div style="\&quot;MARGIN:" 0in="" 0pt\"="">
   <span style="\&quot;FONT-SIZE:" 8pt;="" color:="" #17365d;="" font-family:="" verdana\"="">country name</span>
   <span style="\&quot;FONT-SIZE:" 8pt;="" color:="" #17365d;="" font-family:="" verdana\"="">, country name</span>
  </div>\r\n
  <div style="\&quot;MARGIN:" 0in="" 0pt\"="">
   &nbsp;
  </div>\r\n
  <div style="\&quot;MARGIN:" 0in="" 0pt\"="">
   <span style="\&quot;FONT-SIZE:" 8pt;="" color:="" #17365d;="" font-family:="" verdana\"="">Mobile</span>
   <span style="\&quot;FONT-SIZE:" 8pt;="" color:="" #17365d;="" font-family:="" verdana\"="">&nbsp;&nbsp;&nbsp;: +1234 5678</span>
  </div>\r\n
  <div style="\&quot;MARGIN:" 0in="" 0pt\"="">
   <span style="\&quot;FONT-SIZE:" 8pt;="" color:="" #17365d;="" font-family:="" verdana\"="">Email&nbsp;&nbsp;&nbsp; : &nbsp;<a href="\&quot;mailto:aaa@xyz\&quot;" _fcksavedurl="\&quot;mailto:aaa@xyz\&quot;">aaa@xyz</a></span>
  </div>

Then, in my data, this is a string:

a <- '<P>Dear aaa,</P>\r\n<P>&nbsp;</P>\r\n<P>this is mail body from mail 1.<BR><BR><BR>Regards </P>\r\n<P> </P>\r\n<P> </P>\r\n<DIV></DIV>\r\n<DIV><BR><BR><BR></DIV>\r\n<DIV align=left>\r\n<HR>\r\n<FONT face=Tahoma size=1><B>From:</B> aaa <BR><B>Sent:</B> 5/12/2007 01:31:52 PM GST (GMT +04:00)<BR><B>To:</B> bbb <B>:</B> ccc, ddd, eee <BR><B>Mail Number:</B> 111 <BR><B>Subject:</B> xxxxxxx </FONT><BR><BR><BR></DIV>\r\n<P>aaa, </P>\r\n<P>&nbsp;</P>\r\n<P>this is mail body from mail 2</P>\r\n<P>&nbsp;</P>\r\n<P>Regards, </P>\r\n<P>&nbsp;</P>\r\n<DIV style=\"MARGIN: 0in 0in 0pt\"><STRONG><SPAN style=\"FONT-SIZE: 14pt; COLOR: #3366ff; FONT-FAMILY: 'Viner Hand ITC'\">aaa</SPAN></STRONG></DIV>\r\n<DIV style=\"MARGIN: 0in 0in 0pt\"><B><SPAN style=\"FONT-SIZE: 10pt; COLOR: #17365d; FONT-FAMILY: Verdana\">Logistics Manager</SPAN></B></DIV>\r\n<DIV style=\"MARGIN: 0in 0in 0pt\"><B><SPAN style=\"FONT-SIZE: 10pt; COLOR: #17365d; FONT-FAMILY: Verdana\">Project Services</SPAN></B></DIV>\r\n<DIV style=\"MARGIN: 0in 0in 0pt\">&nbsp;</DIV>\r\n<DIV style=\"MARGIN: 0in 0in 0pt\"><B><SPAN style=\"COLOR: gray; FONT-FAMILY: Verdana\">aaa</SPAN></B><B><SPAN style=\"COLOR: #999999; FONT-FAMILY: Verdana\"> bbb </SPAN></B></DIV>\r\n<DIV style=\"MARGIN: 0in 0in 0pt\"><SPAN style=\"FONT-SIZE: 8pt; COLOR: #17365d; FONT-FAMILY: Verdana\">P.O. Box: 111</SPAN></DIV>\r\n<DIV style=\"MARGIN: 0in 0in 0pt\"><SPAN style=\"FONT-SIZE: 8pt; COLOR: #17365d; FONT-FAMILY: Verdana\">The Project</SPAN></DIV>\r\n<DIV style=\"MARGIN: 0in 0in 0pt\"><SPAN style=\"FONT-SIZE: 8pt; COLOR: #17365d; FONT-FAMILY: Verdana\">country name</SPAN><SPAN style=\"FONT-SIZE: 8pt; COLOR: #17365d; FONT-FAMILY: Verdana\">, country name</SPAN></DIV>\r\n<DIV style=\"MARGIN: 0in 0in 0pt\">&nbsp;</DIV>\r\n<DIV style=\"MARGIN: 0in 0in 0pt\"><SPAN style=\"FONT-SIZE: 8pt; COLOR: #17365d; FONT-FAMILY: Verdana\">Mobile</SPAN><SPAN style=\"FONT-SIZE: 8pt; COLOR: #17365d; FONT-FAMILY: Verdana\">&nbsp;&nbsp;&nbsp;: +1234 5678</SPAN></DIV>\r\n<DIV style=\"MARGIN: 0in 0in 0pt\"><SPAN style=\"FONT-SIZE: 8pt; COLOR: #17365d; FONT-FAMILY: Verdana\">Email&nbsp;&nbsp;&nbsp; : &nbsp;<A href=\"mailto:aaa@xyz\" _fcksavedurl=\"mailto:aaa@xyz\">aaa@xyz</A></SPAN></DIV>'

This example includes 2 mails. I have a lot of emails that are of similar format. I just need to extract the mails bodies. In this case, these are 'this is mail body from mail 1.' and 'this is mail body from mail 2'. It is really hard to use stringr to delete all the unnecessary parts.

Is there already a package for this and I do not know?


Viewing all articles
Browse latest Browse all 29745

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>