Office 2007 documents

Jun 22, 2009 at 4:16 PM

Hi,

 

Has anybody successfully crawled Office 2007 documents (docx, pptx, etc.) on 64bit? Standard documents (doc, xls etc.) and PDF (using the IFilter from Adobe) works well, but something seems to fail when trying to crawl Office 2007 file types.

 

I have installed the 64bit version of the IFilter from Microsoft and followed the instructions (if I remember correct there where nothing to modify).

 

Office 2007 documents are successfully crawled by the standard crawler, so it’s only the custom crawler that doesn’t work. Once again the third party filter for PDF works well.

 

Our environment runs SP2 on MOSS and WSS and with the latest infrastructure updates installed.

 

Thanks and regards

Mats Levinsson

Aug 19, 2009 at 3:43 PM
Edited Aug 19, 2009 at 3:53 PM

Hi,

I have the same issue with configuration:

W2003 x32, MOSS 2007 SP1.

Does anybody know how to fix that?

Aug 19, 2009 at 4:07 PM

Hi Zan,

I did the worst solution thinkable, but we had to be able to crawl docx and pptx documents asap; I wrote my own logic to extract the text from the documents. It was fairly easy and so far it works well.

Sorry I could not give you a better answer.

I’m still very curios if somebody has successfully crawled 2007 documents on a x64 platform, because this really limits the usage of a custom crawler. Even though it fulfill our needs right now, extending it to crawl other systems and web sites is not an option.

Best regards
Mats

Aug 20, 2009 at 8:36 AM

Hi matslev, Thanks for the response. I agree with you. I have written to John, but no answer. I thought that I can somehow to extract document.xml from docx and now it's preferable choose for me.

Aug 20, 2009 at 8:52 AM

Some references on docx and pptx:

http://conceptdev.blogspot.com/2007/03/open-docx-using-c-to-extract-text-for.html,

http://www.codeproject.com/KB/office/ExtractTextFromDOCXs.aspx?msg=2637586,

http://www.aspose.com/community/forums/thread/183245/updating-from-vers.3.0.1.15296-to-3.1.0.0-broke-my-code.aspx

Aug 20, 2009 at 9:44 AM

Also extractor samples in http://www.codeproject.com/KB/IP/Searcharoo_5.aspx

 

Oct 20, 2009 at 7:18 AM

Hi there,

I've got the same problem under 32bit. Just in case anyone else has the same problem here is what I found:

-Install the latest version for the Office 2007 IFilters

- Apparently if you implement GetDocFormat you have to make sure that the MIME Type you pass there is not longer than 33 characters. That's the limit in Sharepoint 2007. For Office 2007 MIME types are quite long. Instead try to implement GetCLSID. That should return the GUID for the Office 2007 documents. You can find the corresponding GUID in registry.

Hope this helps!

Ciprian

 

Oct 23, 2009 at 8:42 PM

Hi,

I'm having the same issue with the docx IFilter.  My protocol handler has been working fine for a couple of years now with all sorts of different IFilters, but for some reason it just flat out won't work with this one.  I tried the suggestion from ciprian and now implement and return the class ID in GetCLSID.  That got me further in that it no longer terminate the filter process, but it's giving me the message you get when the file format is wrong (even though I'm sure it's not).

Anybody got any more information on this?

Thank You.

Oct 24, 2009 at 7:50 AM

Hi Danhun,

Sorry, I forgot to say that you'll also have to add the following registry keys:

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContentIndexCommon\Filters\CLSID\{4887767F-7ADC-4983-B576-88FB643D6F79}]
@="{F90DFE0C-CBDF-41FF-8598-EDD8F222A2C8}"

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContentIndexCommon\Filters\CLSID\{4F46F75F-199F-4C63-8B7D-86D48FE7970C}]
@="{DDFE337F-4987-4EC8-BDE3-133FA63D5D85}"

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContentIndexCommon\Filters\CLSID\{D3B41FA1-01E3-49AF-AA25-1D0D824275AE}]
@="{5A98B233-3C59-4B31-944C-0E560D85E6C3}"

Let us know if it works,

Cipri

Dec 2, 2009 at 10:30 PM

Hi again,

Thanks for your prior responses, yes they helped a lot and it all works.

I think a related one has to do with crawling .MSG files.  Do you happen to know if it's the same thing, where we need to return the CLSID instead of the mime type?  I believe the iFilter for that was in the same filter pack so I'd assume so.

Thanks

Dec 7, 2009 at 3:17 AM

I'm not sure about that. I guess is the CLSID. I do recall that we had this problem a while ago, but don't remember exactly what I did. We were never interested in crawling msg files....I think we 've changed the CLSID for the msg files. Is your MIME Type for outlook longer that 33 characters?

Cipri

Jan 14, 2010 at 3:32 AM

ciprian,

i have completed the steps you described above and when i crawl a docx or xlsx i get no errors, and i can search the file via metadata.  unfortunatly i am not getting a full text index of the document.  in my trace code i see a IFilter is requesting the document chunks (and this works for other document types) but i just does not seem to be doing the cracking of the documents to get its contents.

any suggestions would be appreciated.

thanks

Jan 15, 2010 at 2:56 AM

 

My only suggestion at this stage will be to re-check your code and also the settings. I assume that you can successfully crawl doc , xls and other types of files.

Cipri

Mar 10, 2010 at 10:16 PM

There was an issue with custom protocol handlers and (I believe) container document types such as the new Office file formats, MSG, etc. It is resolved in a post-SP2 rollup (June?)

Our protocol handler works by providing pseudo links to items within our filestore. These links do not have extensions on them, the file type is identified to SharePoint via the appropriate interface method. This was being ignored for container file types. Our workaround was to append the extension to the link for affected file types - it wasn't pretty but it did work.

Again, the issue should be resolved in the latest fix rollups though you might need to talk to Microsoft support to gain access to it.