Returned to site data grabbing for a bit via .NET . Library status (there are some engine improvements):
HAP - https://htmlagilitypack.codeplex.com/ - dead (development stopped in 2012, NuGet package is dated 2014 AFAIR)
Tidy - rulez! After some no activity period
- no commits at SourceForge (or wherever it was) for years
- Google Code Hosting has some fork "to support modern features and HTML5", but I guess nothing was developed and https://code.google.com/ went dead together with many others
https://en.wikipedia.org/wiki/List_of_Google_products#Discontinued_products_and_services
we have http://www.html-tidy.org/ with actual updates
.NET "tidy" stuff looks like a mess as usually. I would suggest that repositories and projects not updated for more than 2 years should not be found by default search.
Googling for something like "tidy .net" founds SourceForge/CodePlex projects like complete rewriting of (Lib)Tidy in .NET and some obscure wrappers of unknown engines.
Pure tidy results (except wrong keyword matching) at https://www.nuget.org/packages?q=Tidy can obfuscate anyway. I choosed 2 freshest ones - TidyManaged and TidyHTML5Managed and decombiled assemblies.After small de<t/f>ective investigation last one looks better as it ships with libtidy.dll.
Well, 2nd lib author even provided enough NuSpec info with github.io link while 1st one can be googled at github.com. Why TidyHTML5Managed can't be found at github.com and what is f*cking site difference? Top google result is https://github.com/blog/1452-new-github-pages-domain-github-io, but truly speaking, I don't understand why search for github.io stuff is not working.
As a result: I'm using TidyHTML5Managed manually copying "working" libtidy.dll (AFAIRemember, x64 version bindings failed on my Win7 64bit). Additionally, correct (X)HTML data-xxx attributes could not be parsed because of complex values like 'some html tags and attributes with "values"' and all that low-level malloc and p/invoke operations looks imperfect giving output strings ended with hundreds of '\0'
No comments:
Post a Comment