collection setup with lucee using cfml – DAStek Softwares Pvt Ltd

A collection is a structured storage area used by CFSEARCH to index and search documents efficiently.

Collection Setup with Lucee using CFML Language

Requirements:

Setup the collection
Implement indexing using an S3 bucket
Configure CFSEARCH
Apply a custom approach to handle the CFSEARCH context-empty bug in Lucee
Apply caching for S3 key lookup to reduce repeated file scans and improve the performance of the custom context used to resolve the CFSEARCH context Issue in Lucee.

A collection is a structured storage area used by CFSEARCH to index and search documents efficiently.
It contains indexed data, metadata, and search-related information organized by Lucee’s search engine.

Set up the collection

Create a folder in the project directory where the collection will be stored.
Example: Create a folder named collections inside the project directory. This folder will contain all collection directories.
Main directory path:
/opt/lucee/tomcat/webapps/ROOT/collections/
Code to create a collection where all “testcollection” collection data will be stored in the testcollection directory.

<cfset indexPath = expandPath(“/collections”)>
<!— Ensure the parent folder exists —>
<cfif NOT directoryExists(indexPath)>
<cfdirectory action=”create” directory=”#indexPath#”>
</cfif>
<cfcollection action=”create” collection=”testcollection” path=”#indexPath#/opinions”
language=”English”>
<cfoutput>Collection opinions created at #indexPath#/testcollection.</cfoutput>

Code to list of collections

Implement indexing using an S3 bucket

This process reads files from an S3 bucket, extracts their content, and indexes them into a Lucee collection for efficient searching.
Code Example: Index Documents from S3 into a Collection

<!— Extract text from PDF using PDFBox (uncomment if needed)
<cfset PDDocument = createObject(“java”,
“org.apache.pdfbox.pdmodel.PDDocument”)>
<cfset PDFTextStripper = createObject(“java”,
“org.apache.pdfbox.text.PDFTextStripper”)>
<cfset pdfFile = createObject(“java”, “java.io.File”).init(tempFilePath)>
<cfset doc = PDDocument.load(pdfFile)>
<cfset stripper = PDFTextStripper.init()>
<cfset cleanContent = stripper.getText(doc)>
<cfset doc.close()>
—>
<!— Index into collection —>
<cfindex
collection=”#collectionName#”
action=”update”
type=”custom”
key=”s3://#bucketName#/#objectKey#”
body=”#cleanContent#”
title=”#listLast(objectKey,’/’)#”
urlpath=”s3://#bucketName#/#objectKey#”
status=”info”
custom1=”#bucketName#”
custom2=”#dateFormat(obj.getLastModified(),’yyyy-mm-dd’)#”
/>
<!— Cleanup temp file —>
<cffile action=”delete” file=”#tempFilePath#”>
<cfset totalIndexed++>
<cfcatch type=”any”>
<cfmail to=”user@gmail.com” from=”admin@gmail.com” subject=”Test
Collection Error – Inner Loop”>
<cfdump var=”#cfcatch.message#”>
</cfmail>
</cfcatch>
</cftry>
</cfloop>
</cfloop>
<cfmail to=”user@gmail.com” from=”admin@gmail.com” subject=”Test Collection
Index Update Result”>
Collection Done
</cfmail>

Add this code to a file and run it via a scheduled task, as processing may take a long time if the S3 bucket contains many documents

Configure CFSEARCH

CFSEARCH is a ColdFusion tag used to search a full-text collection created with CFINDEX. It allows querying documents, returning results with scores, snippets, and suggestions.

Example Search Form and CFSEARCH Usage:

<cfdump var=”#qSearch#”>
<cfif qSearch.recordCount EQ 0>
<p>No results found for “<b><cfoutput>#form.searchText#</cfoutput></b>”.</p>
<cfelse>
<h3>Results for “<b><cfoutput>#form.searchText#</cfoutput></b>”:</h3>
<cfoutput query=”qSearch”>

#title#
Score: #score#
Key: #key#
Snippet: #context#

<p>
<b>#title#</b><br>
Score: #score#<br>
Key: #key#<br>
Snippet: #context#<br>
</p>
</cfoutput>
</cfif>
</cfif>

This will create a simple search page: the user enters a term, ColdFusion searches the collection, and results (or a “no results” message) are displayed.

Apply a custom approach to handle the CFSEARCH context-empty bug in Lucee

There is a bug in Lucee CFSEARCH where the context value is always empty. To address this, we implemented a custom function to generate the context. The context value shows a snippet of text (around 300–400 characters) surrounding the search term, similar to how CFSEARCH worked before the bug.
• The function getS3SnippetByKeyFromHTMLWithCache (orgetS3SnippetByKeyFromPDFWithCache for PDFs) returns a snippet of text around the search term.
• The file name or path is obtained from the urlpath returned by CFSEARCH and passed into these functions.
• The function getCustomContext formats the snippet to mimic the style of the context value that CFSEARCH would normally provide.
• The code checks if the context field is empty and, if so, overwrites it with the custom snippet so that each search result has meaningful surrounding text for the search term.

Apply caching for S3 key lookup to reduce repeated file scans and improve the performance of the custom context used to resolve the CFSEARCH context Issue in Lucee.

This (findS3KeyByTitleWithCache) function implements caching for S3 key lookups to avoid repeated scans of the S3 bucket. It first checks an in-memory cache and returns the cached key if valid; otherwise, it scans S3 for the key, caches the result for 6 hours, and returns it. This improves the performance of custom context generation for search results in Lucee, addressing the CFSEARCH context issue.
findS3KeyByTitleWithCache(bucketName, folderPrefix, title) looks for a file key in S3.
Step 1: Check Application.s3KeyCache for a cached key that hasn’t expired (6-hour TTL).
Step 2: If not cached, scan the S3 bucket for the object using the AWS SDK, handling pagination via continuation tokens.
Step 3: Store the found key in the cache with a timestamp for future lookups.
Returns the matched S3 key or an empty string if not found.
Benefit:
Reduces repeated S3 scans and significantly improves performance when generating the custom context for search results.
Below is a compilation of all CFSEARCH functionality, including the fix for the empty context bug in Lucee using the custom context implementation.

Looks for a file key in S3 by title with caching

<!— looks for a file key in S3 by title with caching —>
<cffunction name=”findS3KeyByTitleWithCache” access=”public” returntype=”string”
output=”false”>
<cfargument name=”bucketName” type=”string” required=”true”>
<cfargument name=”folderPrefix” type=”string” required=”true”>
<cfargument name=”title” type=”string” required=”true”>
<cfscript>
// Cache key (unique per folder + title)
var cacheKey = arguments.folderPrefix & arguments.title;
// Cache expiration time = 6 hours (360 minutes)

Gets a snippet from a PDF file in S3 by title with caching

<!— gets a snippet from a PDF file in S3 by title with caching —>
<cffunction name=”getS3SnippetByKeyFromPDFWithCache” access=”public”
returntype=”string” output=”false”>
<cfargument name=”bucketName” type=”string” required=”true”>
<cfargument name=”folderPrefix” type=”string” required=”true”>
<cfargument name=”title” type=”string” required=”true”>
<cfargument name=”searchKey” type=”string” required=”true”>
<cfargument name=”snippetLength” type=”numeric” required=”false” default=”600″>

<cfscript>
var snippet = “”;
var key = findS3KeyByTitleWithCache(arguments.bucketName,
arguments.folderPrefix, arguments.title);
if (!len(key)) return “”; // File not found
// Create S3 client
var credentials =
createObject(“java”,”com.amazonaws.auth.BasicAWSCredentials”)
.init(s3AccessKey, s3SecretKey);
var provider =
createObject(“java”,”com.amazonaws.auth.AWSStaticCredentialsProvider”).init(creden
tials);
var s3Service =
createObject(“java”,”com.amazonaws.services.s3.AmazonS3ClientBuilder”)
.standard()
.withCredentials(provider)
.withRegion(s3region)
.build();
var s3Object = s3Service.getObject(arguments.bucketName, key);
var inputStream = s3Object.getObjectContent();
// Load PDF directly from input stream
var PDDocument = createObject(“java”,
“org.apache.pdfbox.pdmodel.PDDocument”).load(inputStream);
var PDFTextStripper = createObject(“java”,
“org.apache.pdfbox.text.PDFTextStripper”).init();
var content = PDFTextStripper.getText(PDDocument);
PDDocument.close();
inputStream.close();
// Normalize spaces
content = reReplace(content, “\s+”, ” “, “all”);

// Find keyword
var pos = findNoCase(arguments.searchKey, content);
if (pos EQ 0) return “”;
// Extract snippet around keyword
var halfLen = arguments.snippetLength / 2;
var startPos = max(pos – halfLen, 1);
var endPos = min(pos + len(arguments.searchKey) + halfLen – 1, len(content));
snippet = mid(content, startPos, endPos – startPos + 1);
</cfscript>
<cfreturn snippet>
</cffunction>

Gets a snippet from an HTML file in S3 by title with caching

<!— gets a snippet from an HTML file in S3 by title with caching —>
<cffunction name=”getS3SnippetByKeyFromHTMLWithCache” access=”public”
returntype=”string” output=”false”>
<cfargument name=”bucketName” type=”string” required=”true”>
<cfargument name=”folderPrefix” type=”string” required=”true”>
<cfargument name=”title” type=”string” required=”true”>
<cfargument name=”searchKey” type=”string” required=”true”>
<cfargument name=”snippetLength” type=”numeric” required=”false”
default=”5000″>
<cfscript>
// Step 1: Find the S3 key
var key = findS3KeyByTitleWithCache(arguments.bucketName,
arguments.folderPrefix, arguments.title);
if (!len(key)) return “”; // File not found
// Step 2: Create S3 client and read content
var credentials =
createObject(“java”,”com.amazonaws.auth.BasicAWSCredentials”)
.init(s3AccessKey, s3SecretKey);
var provider =
createObject(“java”,”com.amazonaws.auth.AWSStaticCredentialsProvider”).init(creden
tials);
var s3Service =
createObject(“java”,”com.amazonaws.services.s3.AmazonS3ClientBuilder”)
.standard()
.withCredentials(provider)
.withRegion(s3region)
.build();
var s3Object = s3Service.getObject(arguments.bucketName, key);

var inputStream = s3Object.getObjectContent(); var content = toString(toBinary(inputStream.readAllBytes())); inputStream.close();

// Step 3: Strip HTML tags

content = reReplace(content, “<[^>]+>”, “”, “all”);

// Step 4: Normalize all whitespace

content = reReplace(content, “\s+”, ” “, “all”);

// Step 5: Find keyword case-insensitive

var pos = findNoCase(arguments.searchKey, content);

if (pos EQ 0) return “”; // Keyword not found

// Step 6: Extract snippet

var halfLen = arguments.snippetLength / 2;

var startPos = max(pos – halfLen, 1);

var endPos = min(pos + len(arguments.searchKey) + halfLen – 1, len(content));

var snippet = mid(content, startPos, endPos – startPos + 1);

</cfscript>
<cfreturn snippet>
</cffunction>

Extracts custom context snippets with highlighted keywords

<!— extracts custom context snippets with highlighted keywords —>
<cffunction name=”getCustomContext” access=”public” returntype=”string”>
<cfargument name=”text” type=”string” required=”true”>
<cfargument name=”criteria” type=”string” required=”true”>
<cfargument name=”passageLength” type=”numeric” default=”200″>
<cfargument name=”maxPassages” type=”numeric” default=”3″>
<cfset var cleanText = arguments.text>
<cfset var snippets = []>
<cfset var i = 0>
<cfset var firstPos = 0>
<cfset var startPos = 0>
<cfset var endPos = 0>
<cfset var snippet = “”>
<!— Step 1: Clean text —>
<cfset cleanText = reReplace(cleanText, “<[^>]+>”, ” “, “all”)> <!– Remove HTML tags —
>
<cfset cleanText = reReplace(cleanText, “[\r\n\t]+”, ” “, “all”)> <!– Remove line
breaks/tabs –>

<cfset cleanText = reReplace(cleanText, “v\:\\*.*?}”, ” “, “all”)> <!– Remove VML
patterns –>
<cfset cleanText = reReplace(cleanText, “[^\w\s\.,\-]”, ” “, “all”)> <!– Keep letters,
numbers, spaces, punctuation –>
<!— Step 2: Split criteria by commas or spaces —>
<cfset var phrases = listToArray(arguments.criteria, “,”)>
<!— Step 3: Highlight keywords/phrases —>
<cfloop array=”#phrases#” index=”p”>
<cfset p = trim(p)>
<cfif len(p)>
<!— Match ignoring punctuation around words —>
<cfset cleanText = reReplaceNoCase(cleanText, “(\b|[^a-zA-Z0-
9])#reEscape(p)#(\b|[^a-zA-Z0-9])”, “\1<b>#p#</b>\2”, “all”)>
</cfif>
</cfloop>
<!— Step 4: Extract snippets around highlights —>
<cfloop condition=”i LT arguments.maxPassages”>
<cfset firstPos = findNoCase(“<b>”, cleanText, endPos + 1)>
<cfif firstPos EQ 0>
<cfbreak>
</cfif>
<cfset startPos = max(1, firstPos – (arguments.passageLength / 2))>
<cfset endPos = min(len(cleanText), firstPos + (arguments.passageLength / 2))>
<cfset snippet = mid(cleanText, startPos, endPos – startPos)>
<cfset arrayAppend(snippets, snippet)>
<cfset i++>
</cfloop>
<!— Step 5: Fallback if no matches found —>
<cfif arrayLen(snippets) EQ 0>
<cfset arrayAppend(snippets, left(cleanText, arguments.passageLength))>
</cfif>
<!— Step 6: Return combined snippets —>
<cfreturn arrayToList(snippets, ” “)>
</cffunction>

Simple search form

Perform search if searchText is provided

<cfif structKeyExists(form, “searchText”) AND len(trim(form.searchText))>
<!— Perform search —>
<cfsearch collection=”testcollection” name=”sr” status=”sj”
criteria=”#trim(form.searchText)#” suggestions=”always” contextPassages=”0″
type=”simple” maxrows=”100″>
<cfif sr.recordCount EQ 0>
<p>No results found for “<b><cfoutput>#form.searchText#</cfoutput></b>”.</p>
<cfelse>
<h3>Results for “<b><cfoutput>#form.searchText#</cfoutput></b>”:</h3>
<cfloop query=”sr”>
<!— Extract S3 bucket and key from the url column —>
<cfset urlParts = replace(sr.url, “s3://”, “”, “one”)>
<cfset bucketName = listFirst(urlParts, “/”)>
<cfset fullKey = listRest(urlParts, “/”)>
<!— Extract folderPrefix (everything except last part / file name) —>
<cfset keyParts = listToArray(fullKey, “/”)>
<cfset folderParts = arraySlice(keyParts, 1, arrayLen(keyParts)-1)>
<cfset folderPrefix = arrayToList(folderParts, “/”)>
<!— Ensure trailing slash —>
<cfif right(folderPrefix, 1) NEQ “/”>
<cfset folderPrefix = folderPrefix & “/”>
</cfif>
<!— Get title (case-sensitive file name) —>
<cfset titleFromSearch = sr.title>
<cfset snippetLength = “200”>
<cfif structKeyExists(form, ‘searchText’)>
<!— For HTML —>
<cfset snippetData = getS3SnippetByKeyFromHTMLWithCache(S3bucketName,
folderPrefix, titleFromSearch, trim(form.searchText), snippetLength)>
<!— For PDF
<cfset snippetData = getS3SnippetByKeyFromPDFWithCache(S3bucketName,
folderPrefix, titleFromSearch, trim(form.searchText), snippetLength)>
—>
</cfif>
<cfif !len(trim(sr.context)) AND len(trim(form.searchText))>
<!— overwrite the context column for this row —>

<cfif len(snippetData)>
<cfset querySetCell(sr, “context”, getCustomContext(snippetData,
trim(form.searchText)), sr.currentRow)>
<cfelse>
<cfset querySetCell(sr, “context”, getCustomContext(sr.summary,
trim(form.searchText)), sr.currentRow)>
</cfif>
</cfif>
<p>
<cfoutput>
<b>#sr.title#</b><br>
Score: #sr.score#<br>
Key: #sr.key#<br>
Snippet: #sr.context#<br>
</cfoutput>
</p>
</cfloop>
</cfif>
</cfif>

Trust and Worth

Our Customers

We are having a diversified portfolio and serving customers in the domains namely Sports Management, Online Laundry System, Matrimonial, US Mortgage, EdTech and so on.