Multi-Valued (array) Crawled Property mapped to a single Managed Property?

Jul 21, 2008 at 4:18 AM
Edited Jul 21, 2008 at 4:24 AM

Has anyone successfully created a multi-value crawled property (an array) and mapped it to a single managed property? We are using this protocal handler code and trying to customize it to fit our needs. Here is the situation:

The situation is our data. For example sake, let’s use a facet like Author. If a document has multiple authors, it seems logical that we want to store multiple values for a single crawled property. Conceptually we want to accomplish the following:
Managed Property -- Crawled Property -- Values
Author -->                 Author -->                Joe, Jane, Tom, Harry (Multiple authors for a single document)

Solution 1: (Ideal, but doesn’t seem possible)
It seems logical based on our conceptual design that we want a crawled property to map to an array. So one we delved back into the world of unmanaged code and attempted to solve it. In the end, we “believe” that the protocol handler was successfully creating a crawled property and assigning it to VT_SAFEARRAY. We tried to use the MOSS API to verify the variantType of the crawled property. For the string and integer crawled properties the VariantType is the corresponding VARENUM code for VT_LPWSTR and VT_INT respectively, but for the new crawled property created from our VT_SAFEARRAY the VariantType was 0 (rather than, say, 27 for VT_SAFEARRAY).

In short, we can verify that the crawled property was created but couldn’t verify it’s variantType or what (if any) data it pointed to. Once we got this far and knew the crawled property was at least created, we wanted to map it to a managed property. Given the limitation in the SharePoint UI for managing properties, we figured we would have to tackle this programmatically and found a msdn resource. Based on the text of that article, even if we could map a crawled property to an array, Managed Properties don’t accommodate any data types beyond:

  • Binary
  • Datetime
  • Decimal
  • Integer
  • Text
  • Yes/No

We tried to create a managed property and map it to our “array” crawled property but it threw an exception stating that the properties couldn’t be mapped, and we tried this with all of the available types (above) of managed properties. At this point, we stopped pursuing this solution and went to plan B.

Solution 2: (A sleazy hack)
Given what seems to be like no way to use crawled properties that are arrays, we just decided to create multiple crawled properties, mapped to a single managed property. The number of crawled properties would be the maximum number of values for a single document within that facet. Conceptually:
Managed Property -- Crawled Property -- Values
Author -->                 Author:0 -->             Joe
Author -->                 Author:1 -->             Jane
Author -->                 Author:2 -->             Tom
Author -->                 Author:3 -->             Harry

Notice that we would “index” our crawled properties based on what the maximum values a single document contained. So if the next document had 15 authors we would have 15 crawled properties (Author:0 through Author:15) and the document after that only had one author, it would only use Author:0.

As you probably noticed by the title of this solution, we are not very fond of it. It seems to be a hack / workaround rather than a solution. We wonder if there will be performance or other unknown issues with this. We plan on indexing a lot of metadata for many different types of content and are worried that the number of crawled properties will be astronomical.

Has anyone had luck using a protocal handler to create crawled properties of array that map to a single managed property? Is there something else we are missing? A better work around that we arleady came up with? Any discussion would be helpful.

Thanks,

Ed

Jul 21, 2008 at 2:20 PM
There is some overlap between iFilters and protocol handlers, especially where metadata is concerned. I am a novice when it comes to protocol handlers (and little better on filters), but this might push you in the right direction...

The "FilterSample" project (Windows Search 3.x SDK) indexes tags in HTML documents.
The code at FilterSample.cpp:117 infers that multiple values can be indexed into a Text managed property (specifically Keywords) by separating the values with semi-colons:-

            // if it is the my keywords
else if (wcscmp(pszName, L"mykeywords") == 0)
{
hr = _GetElementText(&pszValue, NULL);
if (SUCCEEDED(hr))
{
CString strValue = pszValue;
// indexer wants semicolons as separator between multi-valued strings
strValue.Replace(L",", L";");

// return this value chunk
chunkValue.SetTextValue(PKEY_Keywords, strValue.GetString());
return S_OK;
}
}
Regards,
    Ed