incremental crawl

Mar 6, 2008 at 11:09 AM
Hi,

i have the project running and i have a question related to incremental crawl.

when i do a full crawl, my contentenumerator returns 100 items and those items get indexed.

when i do a incremental crawl, my contentenumerator returns 1 item (by design), but all the 100 items are called again.

what's the catch here ? does sharepoint always crawl all items ? or am i missing something ?

the log for a not nodified item appears like this.

mossph://la/mainframe/...oduto=103&contrato=1854
Not Modified (Exception of type 'RETURN_HRESULT' was thrown.)
Mar 6, 2008 at 6:51 PM
Hi there,

In an incremental crawl, the protocol handler supposed to checks existing items that were already indexed, too. If you look in the ProtocolHandler class, the method CreateAccessor you will see that there is a test for the lastmodifiedtime....that's the point were an existing item is checked for changes.

Cheers,
C


Mar 7, 2008 at 4:02 PM
Edited Mar 7, 2008 at 4:13 PM
hi,
after some digging, i found this note in "sharepoint sdk 1.3".

"If the protocol handler's BindToFilter method does not implement emitting PID_GTHR_DIRLINK_WITH_TIME, and the CreateAccessor method does not support returning PRTH_S_NOT_MODIFIED, incremental crawls perform essentially the same as full crawls. "

my conclusion is, that we must implement GetChunk/GetValue for PID_GTHR_DIRLINK_WITH_TIME instead of GetChunk/GetText for PID_GTHR_DIRLINK on the container enumerator. the problem for me is on implementing the GetValue, because it returns VT_VECTOR | VT_VARIANT (VT_STR,VT_FILETIME) and i don't understand how to create/assign this value.

can some help ?
TIA

Mar 13, 2008 at 2:52 AM
hi,

sorry, i don't understand your problem and what you try to achieve....can you explain a bit more?

c.
Mar 18, 2008 at 11:12 AM
hi,

i'm indexing a database table with millions of records, and the incremental crawl is basicaly a full crawl,

i got it working and here's how i did it.

Prepare an auxiliary table
1) create a table with the PK and lastupdate of the table i want to index
2) created an aditional column and called it "folder"
3) populated the new table and filled the "folder" column randomly with 100 diferent values

when i crawl the "root" folder, i enumerate the the diferent "folder" values and get max(lastupdate)
--> SELECT FOLDER, MAX(LAST_UPDATE) GROUP BY FOLDER

on GetChunk, instead of creating a PIDGTHRDIRLINK and using GetText to get the link, i used PIDGTHRDIRLINK_WITHTIME and used GetValue to return a variant array with the link and last modification of that link.

this way, on incremental crawls, the only "folder" (group of records) that gets indexed again are the ones who belong to a "folder" that as at least one record changed. the more folders you have, fewer records are indexed.

cheers

Jun 23, 2008 at 4:11 AM
hi,

are you able to copy/paste here your current implementation of the GetValue method?

tnx
Sep 23, 2008 at 8:42 PM
I second that! - I've been trying to return a Variant array with URL and LastModified.  I'm having issues with the interop...

Could you please be so kind to post your implementation - it would help tremendously!


Thanks
Nov 18, 2008 at 9:22 PM
I have a similar problem.  Currently we are throwing the PRTH_S_NOT_MODIFIED result which instructs the indexer that the item has not changed.

eg: hr = (HRESULT)ISearchProtocolReturnCodes.PRTH_S_NOT_MODIFIED;

Of course in our scenario we will have ~10 million items.  If 5 of them change I only want to inspect those 5 NOT 10 million + 5.

If anyone has the solution to this problem please post it!
Nov 21, 2008 at 2:18 PM
I've been in contact with the MS search team.  Seems this behaviour is by design and is unavoidable.
Nov 21, 2008 at 3:32 PM
The behaviour you are searching for does exist.

The code (or functionality) I and others are looking for is - when the list of urls for all content is returned, you can also pass back a last-modified timestamp for each url, which tells the indexing system to not 'call back into the protocol handler' to index the content.  This eliminates spinning up the thread for each url, and then returning a Not Modified return code.

I know this works, as I have several examples of this behavior in C++, but I have been unsuccessful so far at getting a C# interop to communicate.

Nov 24, 2008 at 7:39 PM
In my scenario we have a seperate table where we track changes to our data.  We inspect this table and only return these items as part of an incremental.  If in this case we find 10 items have changed I only want the indexer to request the updates to these 10 items.  As it stands now it will pass in these 10 + all other items it is aware of.  We have logic that instructs the indexer that all the other items have not changed.  I guess if I wanted to implement the lastmodifieddate as you described mob that I would have to return a complete list of ALL items and set the modified date for everything except the 10 to "now"?  In our case we will have 8-10 million indexable items.  I think in our example our current implementation would have the least impact on the indexed system.  Is this the case?
Jan 23, 2009 at 11:03 AM
I got similar but yet more strange problem. I check with the documantion and I read many comments that if I emit PIDGTHRDIRLINK_WITHTIME the crawler will remeber the item date-time and will not crawl again the sub item if there is no change.

When I run the incremanetal crawl, the indexer will try to enumerate items from my container I and emit subItem url + last change time for each of them.
I expect the crawling process to stop here, if no changes where dected.
But it continues with the crawling process for all of my items (in different indexing process), as if there was changes to all of the items, essentialy acting as a Full crawl.
I dont understand thik kind of behavior. Can anyone help me ?

Jan 23, 2009 at 4:03 PM
At the end I totally by chanse  found what is wrong.
It is IsDirectory function. If it does not return false for end items it will treat them as directories and try to crawl no mather that you used PIDGTHRDIRLINK_WITHTIME.
And things cannot be more odd, that actually I'm doing everithing "right".
I used the C#  function Marshal.ThrowExceptionForHr() to return HRESULT from functions, including in the IsDirectory function. 
But ThrowExceptionForHr(long Hresult) does not trow an exception Hresult is >=0 ! (SFALSE is 1) :(((
I'ts could save 3 days of debging, if I knew that ThrowExceptionForHr does not throw exception every time.

So be aware, no method that ratrn positive HRESULT can use Marshal.ThrowExceptionForHr() to return HRESULT.