How does PRTH_GTHR_DIR_LINK_WITH_TIME help in incremental crawls?

Mar 9, 2009 at 4:30 PM
Edited Mar 9, 2009 at 4:33 PM
I am trying to understand how returning dirlink with time help incremental crawl.

If you have opened a container, you can as well return only those urls which are modified since you have all the information. Sending all urls with timestamp and letting filter daemon decide seems to be actually adding more processing for the filter daemon.

Also, when you do an incremental crawl, the filter daemon just randomly fires all the previously crawled urls to the protocol handler, for which the PH must identify the modified time and either return PRTH_S_NOT_MODIFIED or if it is modified return the chunks.

It would be nice if it sends the urls in hierarchical manner, but it does not. If you have 10000 urls and only 10 modified, PH will recieve all 10000 ones. Correct? If it were to be the case that it sends the top level and that top level sends those urls which are modified, and so on, it would have been just ideal.

How does sending dirlink with time help then?

Thanks,
Abhinav
Jul 18, 2009 at 9:02 PM

Agreed.

I guess the initial design for the indexer was to discover content on web sites or in hierarchies of folders.

I would prefer that the design only require you provide a list of items that have been added, updated or deleted. This is what change logs accomplish but I've not seen any examples using this mechanism. SharePoint crawls utilize this mechanism, if I understand correctly.

Has anyone implemented a 3rd party protocol handler using a change log? Any hints?