Re: Parsing (scraping) OpenGraph Tags from html HEAD
2017-08-02 17:54 GMT+02:00 Sannyasin Brahmanathaswami via use-livecode < use-livecode@lists.runrev.com>: > Responding on top > > Jacque's method only gets us a list, not an array, so one ends up having > to write more code to parse the list anyway, your method is more efficient. > > "not comfortable with RegEx" Ha,, right. but it worth the effort to keep > the little grey cells green! I will have to study the regEx… things like ?ms > are "brand new" to me. > So, you win your first Regex training :) (?ms) are regex options. m means multi-lines s means the dot ( '.' ) could also match a return/cr/lf char. > > > re: extracting the head first: I was under the impression your repeat loop > would have to work through the entire text of _HTML unnecessarily and that > extracting the heads would reduce processing time. Well, you are right: but only when the regex will try to match after the last valid pattern. What is most costly is the delete inside the loop; so working only with the ... of your html might be more efficient in this case. But this is more a LC thing. > OTOH, Andre tells me that for this kind of operation, even cell phones > have CPU's that are more powerful than some desktop machines and so perhaps > the time to loop through the entire html source is too trivial to consider > at all. > Yep, as I said, only after the last match, the regex will loop through the end of the html and only one time. About quality concerns, restricting the regex to the part is a good idea as you never know what could be some html in the future... > > Thanks for the effort you put into this. You're welcome. Kind regards, Thierry We are adding OG tags to all the media on our web site (eventually) and our > apps will need to parse that out in various contexts. > > BR > > > > > > On 8/1/17, 10:07 PM, "use-livecode on behalf of Thierry Douez via > use-livecode" use-livecode@lists.runrev.com> wrote: > > 2017-08-02 6:45 GMT+02:00 Sannyasin Brahmanathaswami: > > > Hi Brahmanathaswami, > > > Thanks Thierry > > > > though I'm yet sure when using regEx this is better than using > Jacque's > > method > > > > That's 2 different ways.. > but with the regex one, you have the exact key and value of each tags, > nothing more to do. > > > Either way it would seem prudent to extract the head first before > processing > > > > Mmm, don't really see why, but I've added a line of code for this too > below. > > > > > > > Using jacques method just gets the list.. > > and we need to do more coding to get the array we need. > > > > But your method can only handle 1 tag. > > > > > I was aware of that but didn't know what you want to achieve, > therefore I > leave it for the reader. > However this has nothing to do with the regex but with the code inside > the > repeat loop. > > > Here is another way to do it, changing only *1* line of code inside > the loop > with the same regex as before: > > > > -- to please BR wishes, but not necessary > -- erase everything after >put replaceText( _Html, "(?ms).*?$", empty) into _Html > >repeat while matchChunk( _Html, Rx, p1,p2,p3,p4 ) > put char p1 to p2 of _Html & tab& char p3 to p4 of _Html &cr > after > Rslt > delete char 1 to p4 of _Html >end repeat >delete last char of Rslt -- extra cr > >put Rslt into fld 1 >answer "Got " & the number of lines of Rslt & " og: meta tags!" > > > Building a multi-dimensionnal array after the extraction, > a bit more work inside the repeat loop will be needed, > but the extraction part is still valid. > > > > > Finally, if you are not at ease with regex, go with Jacque's way and > everything will be fine. > There are fundamentally not much differences in between the 2 ways. > > > Kind regards, > > Thierry > > > > > > > > On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez wrote: > > > > So, here is the code: > > > >local Rx, Rslt, _Html, OG > > > >put empty into Rslt > >put URL "https://www.youtube.com/user/kauaiaadheenam"; into > _Html > > > >get > > "(?ms) > 22}(.+?)\x{22}>" > >put IT into Rx > > > >repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 ) > > put char p3 to p4 of _Html into OG[ char p1 to p2 of > _Html ] > > delete char 1 to p4 of _Html > >end repeat > > > > > > > > and you can test it this way: > > > >combine OG using return and ":" > >put OG into fld 1 > > > > > > > > HTH and feel free to ask any question... > > > > Kind regards, > > > > Thierry > > > > > -- > >
Re: Parsing (scraping) OpenGraph Tags from html HEAD
Responding on top Jacque's method only gets us a list, not an array, so one ends up having to write more code to parse the list anyway, your method is more efficient. "not comfortable with RegEx" Ha,, right. but it worth the effort to keep the little grey cells green! I will have to study the regEx… things like ?ms are "brand new" to me. re: extracting the head first: I was under the impression your repeat loop would have to work through the entire text of _HTML unnecessarily and that extracting the heads would reduce processing time. OTOH, Andre tells me that for this kind of operation, even cell phones have CPU's that are more powerful than some desktop machines and so perhaps the time to loop through the entire html source is too trivial to consider at all. Thanks for the effort you put into this. We are adding OG tags to all the media on our web site (eventually) and our apps will need to parse that out in various contexts. BR On 8/1/17, 10:07 PM, "use-livecode on behalf of Thierry Douez via use-livecode" wrote: 2017-08-02 6:45 GMT+02:00 Sannyasin Brahmanathaswami: Hi Brahmanathaswami, Thanks Thierry > > though I'm yet sure when using regEx this is better than using Jacque's > method > That's 2 different ways.. but with the regex one, you have the exact key and value of each tags, nothing more to do. Either way it would seem prudent to extract the head first before processing > Mmm, don't really see why, but I've added a line of code for this too below. > > Using jacques method just gets the list.. and we need to do more coding to get the array we need. > > But your method can only handle 1 tag. > I was aware of that but didn't know what you want to achieve, therefore I leave it for the reader. However this has nothing to do with the regex but with the code inside the repeat loop. Here is another way to do it, changing only *1* line of code inside the loop with the same regex as before: -- to please BR wishes, but not necessary -- erase everything after put replaceText( _Html, "(?ms).*?$", empty) into _Html repeat while matchChunk( _Html, Rx, p1,p2,p3,p4 ) put char p1 to p2 of _Html & tab& char p3 to p4 of _Html &cr after Rslt delete char 1 to p4 of _Html end repeat delete last char of Rslt -- extra cr put Rslt into fld 1 answer "Got " & the number of lines of Rslt & " og: meta tags!" Building a multi-dimensionnal array after the extraction, a bit more work inside the repeat loop will be needed, but the extraction part is still valid. Finally, if you are not at ease with regex, go with Jacque's way and everything will be fine. There are fundamentally not much differences in between the 2 ways. Kind regards, Thierry > On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez wrote: > > So, here is the code: > >local Rx, Rslt, _Html, OG > >put empty into Rslt >put URL "https://www.youtube.com/user/kauaiaadheenam"; into _Html > >get > "(?ms) 22}(.+?)\x{22}>" >put IT into Rx > >repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 ) > put char p3 to p4 of _Html into OG[ char p1 to p2 of _Html ] > delete char 1 to p4 of _Html >end repeat > > > > and you can test it this way: > >combine OG using return and ":" >put OG into fld 1 > > > > HTH and feel free to ask any question... > > Kind regards, > > Thierry > -- Thierry Douez - sunny-tdz.com sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Parsing (scraping) OpenGraph Tags from html HEAD
2017-08-02 6:45 GMT+02:00 Sannyasin Brahmanathaswami: Hi Brahmanathaswami, Thanks Thierry > > though I'm yet sure when using regEx this is better than using Jacque's > method > That's 2 different ways.. but with the regex one, you have the exact key and value of each tags, nothing more to do. Either way it would seem prudent to extract the head first before processing > Mmm, don't really see why, but I've added a line of code for this too below. > > Using jacques method just gets the list.. and we need to do more coding to get the array we need. > > But your method can only handle 1 tag. > I was aware of that but didn't know what you want to achieve, therefore I leave it for the reader. However this has nothing to do with the regex but with the code inside the repeat loop. Here is another way to do it, changing only *1* line of code inside the loop with the same regex as before: -- to please BR wishes, but not necessary -- erase everything after put replaceText( _Html, "(?ms).*?$", empty) into _Html repeat while matchChunk( _Html, Rx, p1,p2,p3,p4 ) put char p1 to p2 of _Html & tab& char p3 to p4 of _Html &cr after Rslt delete char 1 to p4 of _Html end repeat delete last char of Rslt -- extra cr put Rslt into fld 1 answer "Got " & the number of lines of Rslt & " og: meta tags!" Building a multi-dimensionnal array after the extraction, a bit more work inside the repeat loop will be needed, but the extraction part is still valid. Finally, if you are not at ease with regex, go with Jacque's way and everything will be fine. There are fundamentally not much differences in between the 2 ways. Kind regards, Thierry > On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez wrote: > > So, here is the code: > >local Rx, Rslt, _Html, OG > >put empty into Rslt >put URL "https://www.youtube.com/user/kauaiaadheenam"; into _Html > >get > "(?ms) 22}(.+?)\x{22}>" >put IT into Rx > >repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 ) > put char p3 to p4 of _Html into OG[ char p1 to p2 of _Html ] > delete char 1 to p4 of _Html >end repeat > > > > and you can test it this way: > >combine OG using return and ":" >put OG into fld 1 > > > > HTH and feel free to ask any question... > > Kind regards, > > Thierry > -- Thierry Douez - sunny-tdz.com sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Parsing (scraping) OpenGraph Tags from html HEAD
Thanks Thierry though I'm yet sure when using regEx this is better than using Jacque's method on parseHeader pData set the lineDel to "",l)-1 of l & cr after tList end repeat -- do something with tList end parseHeader Either way it would seem prudent to extract the head first before processing put the htmlText of widget "youtubes" into _HTML # interesting convention of underscore usage for var declaration put char ( offset("",_HTML)) to ( ( offset("",_HTML))+6) of _html into tHead Using jacques method just gets the list.. and we need to do more coding to get the array we need. but it returns: "og:site_name" content="YouTube" "og:url" content="https://www.youtube.com/user/kauaiaadheenam"; "og:title" content="Kauai's Hindu Monastery" "og:image" content="https://yt3.ggpht.com/-p766LczvKHY/AAI/AAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xff/photo.jpg"; "og:description" content="{where hinduism meets the future}" "og:type" content="profile" "og:video:tag" content="kauai" "og:video:tag" content="hawaii" "og:video:tag" content="hindu" "og:video:tag" content="hinduism" "og:video:tag" content="siva" # And many more tags total of 39 tags… But your method can only handle 1 tag. description:{where hinduism meets the future} image:https://yt3.ggpht.com/-p766LczvKHY/AAI/AAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xff/photo.jpg site_name:YouTube title:Kauai's Hindu Monastery type:profile url:https://www.youtube.com/user/kauaiaadheenam video:tag:scriptural #r est of the tags, all preceeding 38 of them, are lost -- "scriptural" was the last one # and so stands as the final output for the key as the loop which is # effectively retain the single key "og:video"tag" and replacing the value 39 times # leaving us with on the last value of the 39th tag. # so we would need an ordered multi-dimensional array like OG["site_name"] # and the other top keys, then: OG["video"]["tags"][1] OG["video"]["tags"][2] But I'm not sure we need tags for the particular use case in question which is to create a robust "history" of web viewing with more detail.OTOH, since we are coding for "Oh God" data, we may as well get all the tags into the array. This could be useful later to have this code in the toolbox for when we *do* want all the tags from the OG set… God does not like to see partial metadata, because S/He Knows All the Metadata. BR On 7/31/17, 12:31 AM, "use-livecode on behalf of Thierry Douez via use-livecode" wrote: So, here is the code: local Rx, Rslt, _Html, OG put empty into Rslt put URL "https://www.youtube.com/user/kauaiaadheenam"; into _Html get "(?ms)" put IT into Rx repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 ) put char p3 to p4 of _Html into OG[ char p1 to p2 of _Html ] delete char 1 to p4 of _Html end repeat and you can test it this way: combine OG using return and ":" put OG into fld 1 HTH and feel free to ask any question... Kind regards, Thierry ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Parsing (scraping) OpenGraph Tags from html HEAD
On 07/29/2017 01:16 PM, Sannyasin Brahmanathaswami via use-livecode wrote: LOL. I guess Brahmanathaswami's been around these parts long enough by now to have OG status. -- Mark Wieder ahsoftw...@gmail.com ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Parsing (scraping) OpenGraph Tags from html HEAD
2017-07-29 22:16 GMT+02:00 Sannyasin Brahmanathaswami : > you want to extract from the of the document the openGraph tags > > > https://www.youtube. > com/user/kauaiaadheenam"> > > https://yt3.ggpht. > com/-p766LczvKHY/AAI/AAA/SIu6ZAJbMDc/s900- > c-k-no-mo-rj-c0xff/photo.jpg"> > > > c) you also cannot depend on the output being line delimited, because some > CMS's delivery "agents" will minimize this to > > content="https://www.youtube.com/user/kauaiaadheenam";> property="og:title" content="Kauai's Hindu Monastery"> property="og:image" content="https://yt3.ggpht. > com/-p766LczvKHY/AAI/AAA/SIu6ZAJbMDc/s900- > c-k-no-mo-rj-c0xff/photo.jpg"> content="{where hinduism meets the future}"> > > Has anyone rolled up a parser/scraper for this? Looks like "idiot simple text extraction" Hi, Here is a quick coded piece of code, tested only on your URL. I did write this regex based on the Datas you provide in your email. > I see the other thread on scraping pages generated by JS and suspect > perhaps some wizard among us already has this done…would save a bit of time > here. > > BR > Every time you see any kind of scraping/search/extraction/transformation in JS, you can be sure it's possible to do it in LiveCode So, here is the code: local Rx, Rslt, _Html, OG put empty into Rslt put URL "https://www.youtube.com/user/kauaiaadheenam"; into _Html get "(?ms)" put IT into Rx repeat while matchChunk( _Html, Rx,p1,p2,p3,p4 ) put char p3 to p4 of _Html into OG[ char p1 to p2 of _Html ] delete char 1 to p4 of _Html end repeat and you can test it this way: combine OG using return and ":" put OG into fld 1 HTH and feel free to ask any question... Kind regards, Thierry -- Thierry Douez - sunny-tdz.com sunnYrex - sunnYtext2speech - sunnYperl - sunnYmidi - sunnYmage ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Parsing (scraping) OpenGraph Tags from html HEAD
" delimiters can now be more than a single character." Hmm, that completely did not cross my mind… awesome.. On 7/29/17, 5:36 PM, "use-livecode on behalf of J. Landman Gay via use-livecode" wrote: Here's where it's handy that delimiters can now be more than a single character. This should extract the lines you need regardless of whether they contain carriage returns or not: on parseHeader pData set the lineDel to "",l)-1 of l & cr after tList end repeat -- do something with tList end parseHeader ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Parsing (scraping) OpenGraph Tags from html HEAD
Here's where it's handy that delimiters can now be more than a single character. This should extract the lines you need regardless of whether they contain carriage returns or not: on parseHeader pData set the lineDel to "",l)-1 of l & cr after tList end repeat -- do something with tList end parseHeader On 7/29/17 3:16 PM, Sannyasin Brahmanathaswami via use-livecode wrote: given that a) trying to instantiate an XML tree from any given web page is likely to fail 85% of the time because they simply are never built to that strict a standard and b) you want to extract from the of the document the openGraph tags https://www.youtube.com/user/kauaiaadheenam";> https://yt3.ggpht.com/-p766LczvKHY/AAI/AAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xff/photo.jpg";> c) you also cannot depend on the output being line delimited, because some CMS's delivery "agents" will minimize this to https://www.youtube.com/user/kauaiaadheenam";>https://yt3.ggpht.com/-p766LczvKHY/AAI/AAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xff/photo.jpg";> Has anyone rolled up a parser/scraper for this? Looks like "idiot simple text extraction" but I'm trying to wrap my head around how to extract the name=value pairs, and not getting anything easy… these are space delimited, but then we also have spaces inside quoted strings. Maybe easier target "" using regEx with matchText, get ALL the meta tags in the HEAD, push to array then just check for if key contains "og:" then we have an openGraph value. I'll sleep on this, but but before I wake up and write 50 lines to get this done… I see the other thread on scraping pages generated by JS and suspect perhaps some wizard among us already has this done…would save a bit of time here. BR ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode -- Jacqueline Landman Gay | jac...@hyperactivesw.com HyperActive Software | http://www.hyperactivesw.com ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Re: Parsing (scraping) OpenGraph Tags from html HEAD
Hi Swami, I know you can do this in Javascript, but you will have to enumerate through a JavaScript object to get all of the properties: https://www.w3schools.com/jsref/prop_meta_content.asp Sent from my iPhone > On Jul 29, 2017, at 4:16 PM, Sannyasin Brahmanathaswami via use-livecode > wrote: > > given that > > a) trying to instantiate an XML tree from any given web page is likely to > fail 85% of the time because they simply are never built to that strict a > standard > > > and > > > b) you want to extract from the of the document the openGraph tags > > > https://www.youtube.com/user/kauaiaadheenam";> > > content="https://yt3.ggpht.com/-p766LczvKHY/AAI/AAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xff/photo.jpg";> > > > c) you also cannot depend on the output being line delimited, because some > CMS's delivery "agents" will minimize this to > > content="https://www.youtube.com/user/kauaiaadheenam";> property="og:title" content="Kauai's Hindu Monastery"> property="og:image" > content="https://yt3.ggpht.com/-p766LczvKHY/AAI/AAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xff/photo.jpg";> property="og:description" content="{where hinduism meets the future}"> > > Has anyone rolled up a parser/scraper for this? Looks like "idiot simple > text extraction" but I'm trying to wrap my head around how to extract the > name=value pairs, and not getting anything easy… these are space delimited, > but then we also have spaces inside quoted strings. Maybe easier target > "" using regEx with matchText, get ALL the meta tags in the HEAD, > push to array then just check for if key contains "og:" then we have an > openGraph value. > > I'll sleep on this, but but before I wake up and write 50 lines to get this > done… I see the other thread on scraping pages generated by JS and suspect > perhaps some wizard among us already has this done…would save a bit of time > here. > > BR > > > > > ___ > use-livecode mailing list > use-livecode@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your subscription > preferences: > http://lists.runrev.com/mailman/listinfo/use-livecode ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode
Parsing (scraping) OpenGraph Tags from html HEAD
given that a) trying to instantiate an XML tree from any given web page is likely to fail 85% of the time because they simply are never built to that strict a standard and b) you want to extract from the of the document the openGraph tags https://www.youtube.com/user/kauaiaadheenam";> https://yt3.ggpht.com/-p766LczvKHY/AAI/AAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xff/photo.jpg";> c) you also cannot depend on the output being line delimited, because some CMS's delivery "agents" will minimize this to https://www.youtube.com/user/kauaiaadheenam";>https://yt3.ggpht.com/-p766LczvKHY/AAI/AAA/SIu6ZAJbMDc/s900-c-k-no-mo-rj-c0xff/photo.jpg";> Has anyone rolled up a parser/scraper for this? Looks like "idiot simple text extraction" but I'm trying to wrap my head around how to extract the name=value pairs, and not getting anything easy… these are space delimited, but then we also have spaces inside quoted strings. Maybe easier target "" using regEx with matchText, get ALL the meta tags in the HEAD, push to array then just check for if key contains "og:" then we have an openGraph value. I'll sleep on this, but but before I wake up and write 50 lines to get this done… I see the other thread on scraping pages generated by JS and suspect perhaps some wizard among us already has this done…would save a bit of time here. BR ___ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode