javascript - How to parse an email signature to get the details separately? -


i have requirement project parse signature of mails gmail account. , signature have fetch first name, last name, mail id, etc. [only sender's]. can please let me know start from? ("where start from" in sense, there thing in-place already?)

i have gone through question, question speaks removing signature stuff, opposite requirement. answer not solve problem.

i know can use regex done. don't want miss out mails not follow netiquettes of mail signatures removing "--" before signature, trailing hyphens.

and if possible please let me know of open source javascript projects provide functionalities.

thanks in advance.

update: signatures looking business related contain html content or vcards directly.

update: need strip each line of signature , details these lines.

there several potential parts answering question.

signatures within gmail interface

within gmail interface, signatures easy grab. wrapped in <font color="#888888">, getting xmlreader should pretty easy, if you're getting signatures within gmail interface. won't signatures gmail doesn't detect.

signatures in messages sent gmail using signature setting

just <div class=3d"gmail_signature"> in html version of email.

a general method of signature parsing

i arbitrarily limiting target contact information of sender. such, makes sense contact information in signature. many emails contain contact information people other sender, first step isolate signature.

once signature isolated, each line can matched against regex patterns. no means regex expert, won't attempt describe actual patterns here.

what follows method, not code. actual implementation should pretty straightforward.

grabbing signatures email

  1. remove except rendered text in target message. leave \n newlines in proper places.
  2. work bottom of message, storing each line in variable. stop when hit long line (60+ characters, exact number needs experimentation1). don't include long line.
  3. if there number of \n in middle somewhere, remove them , above them. remove short lines , closing salutations.2

now signature isolated.

here assumptions parts remaining. unless order specified, assume can in order.

a) end of message , closing greeting topmost line(s) b) name c) phone number d) email address e) mailing address f) tag line or witty saying, etc. 

1 the 60 character line length based on fact rfc 2822 suggests lines should 78 characters long. gmail respects this. signature lines shorter that, unless whole address written single line. signatures extremely short emails (< 20 words) not detected method, trivial first check total message length , use different code deal that.

2as signatures automatically added, there series of newlines before them. however, hand-typed signatures may not follow pattern, depending on type of emails you're processing, may find step unhelpful or detrimental.

identifying parts of signature

now have reduced likelihood of false positive matches regex, can see if remaining lines match of patterns.

  1. replace common dividers newlines, | common example.
  2. check if of lines match regex patterns. if do, remove them further consideration. hardest part differentiating names other things. suggested order:

    email

    phone

    zip code (then address, if find zip code)

left should closing salutation, name, tag line, , malformed parts of items above. aware while regex used find errors (for validation), want match errors, remove lines further processing, validate or normalize.

in view, hardest part of figuring out part distinguishing names tag lines. here suggestions should common cases:

  1. names consist of small number of words.
  2. names contain periods in places - after 1-3 letter words. (french has m. messieur)
  3. names don't contain punctuation. dashes , apostrophes, in addition periods above. might run issues commas before titles, example, john lawyer, esq.
  4. tag lines end comma
  5. capitalization can hint (but not definitively say) whether name.

further, can blacklist common closing salutation words (sincerely, thank(s), cheers, etc.) if narrows 1 or 2 lines, upper 1 name , lower 1 tag line.

for more information identifying names, see find names regular expression. remember while should easy write solution in general case, natural language processing huge , beyond scope of mortals me. named entity recognition known challenge. hopefully, i've described in cases.


Comments

Popular posts from this blog

python - Healpy: From Data to Healpix map -

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -