RE: Searching through pages instead of objects

Lanny R. Udey Wed, 21 Mar 2001 03:54:21 -0800
I didn't see your original post, but we use a spider based indexer to index the pages 
instead of verity or SQL.  Threre are several you can buy or you can use an ASP like 
Atomz which is what we ended up doing because of all the features they had.

Lanny Udey
Associate Dean,
Learning and Information Technology
Hofstra University
[EMAIL PROTECTED]

>>> [EMAIL PROTECTED] Wednesday, March 21, 2001 >>>
Could try this to create an index of all objects currently published onto
pages.
It's what I've used in 1.01 and directly accesses the sitecomposition table
so probably doesn't work in 1.5.
It also wouldn't work properly if you use any personalisation rules in your
containers since the user that runs the indexer wouldn't have any options
set to show content.
It also needs cfx_pcregex which you can download from the tag gallery and I
suspect the sql I've used may only work with Oracle.
Anyway, it may give you a start for your own indexing.

Regards,
Andy

<!---
cf_indexsite.cfm

Takes a compositionid (site,section or page) and indexes the pages below it

Attributes:
datasource              (REQUIRED)      datasource that contains the sitemodel
compositionid   (REQUIRED)      the objectid of the site, section or page to index
collection              (REQUIRED)      the collection name to index to
purge                   (OPTIONAL)      boolean to indicate whether the purge the 
index first
--->

<cfparam name="attributes.datasource" type="string">
<cfparam name="attributes.compositionid" type="UUID">
<cfparam name="attributes.collection" type="string"
default="ProvidentSiteIndex">
<cfparam name="attributes.purge" type="boolean" default="FALSE">

<!--- get all the pages --->
<cfquery name="GetPages" datasource="#attributes.datasource#">
SELECT compositionid,compositionlabel,absoluteurl,relativeurl,absolutepath
FROM sitecomposition
WHERE LOWER(compositiontype)='page'
START WITH compositionid = '#attributes.compositionid#'
CONNECT BY parentcompositionid = PRIOR compositionid
</cfquery>

<!--- set up a query to store the results to be indexed --->
<cfset qPages = QueryNew("objectid,title,url,content")>

<!--- loop through the pages --->
<cfloop query="GetPages">
        <!--- make sure we've got a url to the page --->
        <cfif Len(Trim(absoluteurl))>
                <cfset url = absoluteurl>
        <cfelse>
                <cfif Len(Trim(relativeurl))>
                        <cfset url = "http://" & cgi.server_name & relativeurl>
                <cfelse>
                        <cfset url = REReplace(absolutepath,"[[:alpha:]]:\\","http://" 
&
cgi.server_name & "/")>
                </cfif>
        </cfif>
        <cfset url = Replace(url,"\","/","ALL")>
        <cfset url = Replace(url," ","%20","ALL")>

        <!--- get the page --->
        <cftry>
                <cfhttp url="#url#" method="GET" resolveurl="false" useragent="PF 
Indexer"
timeout="10" throwonerror="yes">

                <!--- rip out the whole head and any other tags --->
                <cfx_pcregex subject="#cfhttp.filecontent#"
pattern="(?isU)<HEAD>.*</HEAD>" results="content" replace="" count="ALL">
                <cfx_pcregex subject="#content#" pattern="(?isU)<STYLE[^>]*>.*</STYLE>"
results="content" replace="" count="ALL">
                <cfx_pcregex subject="#content#" 
pattern="(?isU)<SCRIPT[^>]*>.*</SCRIPT>"
results="content" replace="" count="ALL">
                <cfx_pcregex subject="#content#" pattern="(?isU)<[^>]*>" 
results="content"
replace=" " count="ALL">

                <cfscript>
                QueryAddRow(qPages);
                QuerySetCell(qPages,"objectid",compositionid);
                QuerySetCell(qPages,"title",compositionlabel);
                QuerySetCell(qPages,"url",url);
                QuerySetCell(qPages,"content",content);
                </cfscript>

                Indexed <cfoutput>#url#</cfoutput><br>

                <cfcatch>
                Couldn't get <cfoutput>#url#</cfoutput><br>
                <cfa_dump var="#cfcatch#">
                </cfcatch>
        </cftry>

</cfloop>

<cfif attributes.purge>
        <cfindex action="PURGE" collection="#attributes.collection#">
</cfif>

<!--- index the results --->
<cfindex action="UPDATE" collection="#attributes.collection#" key="objectid"
type="CUSTOM" title="title" query="qPages" body="content" custom1="url"
custom2="">
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Structure your ColdFusion code with Fusebox. Get the official book at 
http://www.fusionauthority.com/bkinfo.cfm
------------------------------------------------------------------------------
To Unsubscribe visit 
http://www.houseoffusion.com/index.cfm?sidebar=lists&body=lists/spectra_talk or send a 
message to [EMAIL PROTECTED] with 'unsubscribe' in the body.
RE: Searching through pages instead of objects

Reply via email to