How to extract an email?


#1

I want to ‘extract’ an email from a bunch of text, for example, from:

•3/23/17 @1:59 PM - Doug & Kristin off to NY by way of Utah etc for a month; gone six months?
kristin_douglas@yahoo.com

is there a way to grab the email and not include the ‘•3/23/17 @1:59’ ?
Thanks.


#2

If this is a consistent format, you could just use a text funnel to get everything from the hyphen onward.

["-",-1] 

or

["-",-1][2,-1]

if you don’t want to include the hyphen.


#3

The example for extracting email addresses given in the Help file for regexarray( function will work on your text string:

regexarray(TEXT,"([\w!#$%&'*+/=?`{|}~^.-]+)@([A-Z0-9.-]+\.[A-Z]{2,3})",", ")

Using this on your text string:

regexarray("•3/23/17 @1:59 PM - Doug & Kristin off to NY by way of 
Utah etc for a month; gone six months? kristin_douglas@yahoo.com",
"([\w!#$%&'*+/=?`{|}~^.-]+)@([A-Z0-9.-]+\.[A-Z]{2,3})",", ")

Returns: kristin_douglas@yahoo.com

Hold it, I see after rereading that you probably wanted the entire text after the initial section and not the email address at the end. I default to Dave’s solution.


#4

Thanks for the help. I did want just the email address.
I’ll give the regexarray a shot…my first regex!

Also am wondering if isolation of the (first) email in a bunch of text can be done in Pan6?


#5

You could use the onewhitespace( function on the text, and then do an ArrayFilter with space as the separator. The formula would test for

import() match "*?@?*.?*"

that would ensure that there was at least one non-white space character on either side of the @, and that it included a dot somewhere after the @, with no intervening spaces.


#6

That would work in the original example data, but would not work if the email had any punctuation immediately before or after it, like a comma, period, parenthesese, etc.) This is a very difficult problem to solve without regular expressions (and even then, it may not be perfect).


#7

Gary’s example works for me in Pan X. Dave’s example (on whidh Jim has commented), did not work.
I used:
/* Monday, December 4, 2017

Also am wondering if isolation of the (first) email in a bunch of text can be done in Pan6?
You could use the onewhitespace( function on the text, and then do an ArrayFilter with space as the separator. The formula would test for

import() match "?@?.?*"
that would ensure that there was at least one non-white space character on either side of the @, and that it included a dot somewhere after the @, with no intervening spaces.
*/

local x,lEmail
x=Comment //which contains an email address
lEmail=onewhitespace(x)

arrayfilter x,lEmail," ",import() match “?@?.?*”

clipboard()= "lEmail is "+str(lEmail)
clipboard()=“lEmail is 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0”


#8

This is more like what I had in mind.

local x,lEmail
x=Comment //which contains an email address
lEmail=onewhitespace(x)

arrayfilter lEmail,lEmail," ",?(import() match "?@?.?*",import(),"")
clipboard=array(arraystrip(lEmail," "),1," ")

Your version is returning -1 when it thinks it’s found an email address, and 0 when it’s not an email address.


#9

Dave, should’n the match be more like:

import() match "?*@?*.?*"

#10

Or easier: textafter("•3/23/17 @1:59 PM - Doug & Kristin off to NY by way of Utah etc for a month; gone six months?
kristin_douglas@yahoo.com","-",1) (or "= " if you do not want the initial space.)

If you want to extract the email address, you can use a text funnel to throw out all the text before the final space:
•3/23/17 @1:59 PM - Doug & Kristin off to NY by way of Utah etc for a month; gone six months?
kristin_douglas@yahoo.com["- ",-1].

Or more simply, you can use lastword("•3/23/17 @1:59 PM - Doug & Kristin off to NY by way of Utah etc for a month; gone six months?
kristin_douglas@yahoo.com")

These functions are in both 6 and X. There are an abundance of methods for doing this.


#11

In Craig’s initial example the email address happened to be at the end of the text, but I’m pretty sure he wanted to extract the email address no matter where it was in the text, and if there are multiple email addresses, he wanted to extract all of them. So any solution that assumes a fixed location is not what Craig was looking for. Dave’s arrayfilter idea will work in Panorama 6, but only if the email address has white space around it, not any punctuation. The regular expression solution will work even if there is punctuation before or after, and will reject most invalid email addresses, for example it knows that joe@123 is not a valid email address even though it kind of looks like one. However, of course, regular expressions can only be used in Panorama X.


#12

The biggest problem is that the question is not stated precisely. Often that is the biggest problem in database design.


#13

Yes it should. I was editing Craig’s version, and I didn’t notice that he had dropped an asterisk from the formula I had suggested.

Oh. He didn’t drop the asterisks. The forum did, when he quoted. If it’s not formatted as code, the asterisks need to be escaped.

And that’s what you just said.


#14

Well, it looks like the asterisks are dropped when quoting as show when you quoted my post above.


#15

Here is a version of Dave’s Panorama 6 scheme on steroids. It should get the lion share of addresses including those adjacent to punctuation marks. There will obviously be special situations where it will not work like if there is only a punctuation between the address and additional text without any existing white space. Anyway, here it is:

local lEmail
lEmail=TheField //field containing text in question
    //allow only characters used in email addresses plus space & return
lEmail=stripchar(lEmail,
    "@@..  !!##$$%%&&''**++--//==??^^__``{{||}}~~;;09azAZ"+¶+¶)
    //replace spaces with returns
lEmail=replace(lEmail," ",¶)
    //extract text that match an email address & remove trailing period
arrayfilter lEmail,lEmail,¶,?(import() match 
    "?*@?*.?*",import()[1,"≠-."],"")
   //strip extra returns
lEmail=arraystrip(lEmail,¶)
    //outputs a return separated list of included email addresses
message lEmail