Monday, March 17, 2014

Enumerating and Scanning a Large Number of Files recursively with Powershell for a pattern

Useing Powershell 3.0,

I found a large discrepency in terms of performance using the Powershell Commandlets, how you call them, and their execution vs .NET Classes in System. IO namespace.

Scenario:
Scanning a large number of files recursively (approximately 32,000 files across 238 folders) to match filename to a pattern.  (of which 5 files match the pattern for this example)

Code sample using powershell Get-childitem commandlet:

 
$userID = "TestUser"
allContents = (gci \\Server\share -Recurse | where {($_ -like "*$($userID).*")
This command was taking 3-5mins to process, totally unacceptable processing time on something that must be rerun continuously..

Upon examing this code, I am trying to first create a powershell object with 32K+ "FileInfo" Objects items, then filtering the results with a Where-object command on the pipeline.

Re-writing the Powershell commandlet like this increased performance dramatically..
$allContents = gci "\\nw-svc-bu01\vol1\WSINFO" -Filter "*SINESC.*" -Recurse
allContents = gci "\\nw-svc-bu01\vol1\WSINFO" -Filter "*SINESC.*" -Recurse
$allContents = gci "\\server\share" -Filter "*$($userID).*" -Recurse



This command returns results in 10 seconds.  This command 'skims' the file directory structure, and only returns files that match the filter.  The more files that match the filter, the slower the performance will be because it still returns an array of "FileInfo" objects.  So the creating of the "FileInfo" objects are slow point, not the scanning of the folders recursively.

Code sample using .NET Class:

 
[void] [System.Reflection.Assembly]::LoadWithPartialName("System.IO")

$userID = "TestUser"
$allConents = ((([IO.Directory]::EnumerateFiles(\\server\share,"*$($userID).*",[System.IO.SearchOption]::AllDirectories)) | out-string).trim()).Split()


This command returns results in 7 seconds.  The EnumerateFiles Method returns a System.Collections.Generic.IEnumerable Object.  Which only contains the full file path of the files into a collection.

So if you need performance it appears that .NET Classes are slightly superior, if you only need the file path returned for your query.  It also points out that how you call and execute powershell commandlets can impact performance dramatically as well.

No comments:

Post a Comment