PDA

View Full Version : Saving complete web pages (with images) as files, not archive


timb
2006-11-28, 06:06 PM
Please add the possibility to save the complete page (with images) as separate files (html file + images etc., maybe with a subfolder) like Firefox, Opera and others do it (by adjusting the HTML code that points to the images etc.). No PDF, no web archive, just something that's compatible with every browser.

See the thread http://forums.omnigroup.com/showthread.php?t=2275.

Tim B.

Forrest
2006-11-28, 06:52 PM
I just tried it in Firefox, it doesn't work. If the page is very simple, I'm sure it probably works. I rarely says absolutes, but I would say it's impossible for such a feature to be reliable.

zottel
2006-11-29, 11:13 AM
Why should it be so difficult? You only have to recode all tags that point to stuff used in the page so that they reflect the relative path to the file on the hd. And, in order to be sure that links will still work, recode all links that point away from the page from relative to absolute. That's all—or am I missing something?

The only difficulties I can imagine are to decide what to call the html file if the source was dynamically generated, links might not work anymore if they contained session information, and problems with flash or other content that will run in the browser that might load other stuff when it's started.

But these are problems relevant to the generation of .webarchives, too.

Remember, we're talking about a single web page here, not about some part of the file tree of an entire website.

wget (commandline websucking tool) has been able to do this since I used it for the first time, which was more than 5 years ago.

Forrest
2006-11-29, 12:43 PM
Why should it be so difficult? You only have to recode all tags that point to stuff used in the page so that they reflect the relative path to the file on the hd. And, in order to be sure that links will still work, recode all links that point away from the page from relative to absolute. That's all—or am I missing something?

That's all? Sure, but that's a lot to ask.

The only difficulties I can imagine are to decide what to call the html file if the source was dynamically generated, links might not work anymore if they contained session information, and problems with flash or other content that will run in the browser that might load other stuff when it's started.

And a lot of sites and certainly most of the popular ones will all suffer from those problems. If the site has Flash and it references any files or links to any files, you can pretty much bet it will break. A lot of JavaScript and CSS also breaks in my tests saving from Firefox 2 and IE7. This typically results in the page being badly broken.

But these are problems relevant to the generation of .webarchives, too.

In my tests, that's not the case. Sites that broke when saving as completed HTML from IE7 and FF2 did not break when saved as a webarchive.

Remember, we're talking about a single web page here, not about some part of the file tree of an entire website.

I must be completely missing what you're trying to say with that. Saving the source would be talking about a single web page, but saving it as a "complete" page is most certainly trying to save a part of the file tree of an entire site.

wget (commandline websucking tool) has been able to do this since I used it for the first time, which was more than 5 years ago.

I haven't used that, but I would be seriously surprised if it didn't suffer from the same issues that IE7 and FF2 do.

timb
2006-11-29, 03:50 PM
I can't say that I had many problems with saving pages that way (but, well, this has been on Windows for some years now). Some occassional glitches (very rare), but one browser or the other would always save that specific page completely. The only thing slightly damaged might have been the layout of the page, but I can live with that. I religiously keep my notes in plain-text files and my huge archive of web pages in highly compatible single html files together with their adjacent files.

And cf. archives: There's not so much difference between a folder structure and the internal structure of an archive, or am I wrong with that?

TB

Forrest
2006-11-29, 03:54 PM
This has probably worked well in the past, but as newer techniques get used with sites, it's going to become less and less reliable.

zottel
2006-11-30, 03:47 PM
That's all—yes, I don't think that's so much. Not much more than 30 lines in perl, I'd guess. (More than a bit more than that in C, of course, but compared to the complexity of a whole browser that's peanuts, IMHO.)

But, OTOH—yes, I haven't used wget for years, and at that time even CSS wasn't very widespread. I don't know if wget is able to handle Javascript or CSS stuff. And no way it can handle flash. ;-)

And about .webarchives—it actually depends on how they're working. Maybe they really put the browser into the state it was in when you actually viewed the page, so relative links would still point to the correct destination without having to be recoded. That would really make things easier, of course. If that format is some kind of file version of the internal model the browser uses, nothing would have to be changed. If not, at least the tags pointing to the stuff used in the page would have to be recoded in some way. As good as it's working, though, I guess it's really more some kind of saving the browser state than saving recoded pages.

Regarding the file tree: I thought you were maybe thinking of something like sucking down a whole forum for offline viewing, which is much more difficult, of course—deciding what depth to use to be sure that everything you need is there etc.

But I agree that with all the new techniques it has become quite difficult to save a representation of what you're viewing as some kind of a source tree. As long as there's no Javascript or other dynamic stuff included, it's not really a big problem. It's probably impossible with AJAX stuff, though.

Forrest
2006-11-30, 04:17 PM
So I'm curious what the reason is why the results need to be in HTML/CSS... rather than, for example, a PDF. The only difference (I can think of) is the ability to copy or edit the code.

zottel
2006-11-30, 04:38 PM
And cf. archives: There's not so much difference between a folder structure and the internal structure of an archive, or am I wrong with that?

As I said in the posting above, I guess that .webarchives are in fact some representation of the internal model of the browser. That means that when a .webarchive is loaded, the browser will be put into exactly the same state it was in when you were actually viewing the page. This way, several problems can be avoided. Above all, the browser is practically in the same server directory. So all relative links, be it in images or links or Javascripts or Flash animations, will still point to the correct destination without changing anything. Additionally, any dynamic content, even if it's ajaxly dynamic, ;-) will still have just the same representation as it had when you were actually viewing the page. It would be extremely difficult, if not impossible, to get this by translating that stuff to actual files and still be able to interact with it when you view it again (like moving a map on maps.google.com).

Edit: Interactivity will also be broken with .webarchives, if the page has changed meanwhile, of course. If Google decides to use some other Javascript model for moving maps, your old .webarchive will still show the same as before, but you won't be able to move the map anymore.

zottel
2006-11-30, 04:49 PM
So I'm curious what the reason is why the results need to be in HTML/CSS... rather than, for example, a PDF. The only difference (I can think of) is the ability to copy or edit the code.

Well, .webarchives do only work with Webkit browsers. PDFs don't give you the possibility to interact with the page, e.g. follow links (although this would be possible, but I doubt it's implemented that way (never tried)).

If a html/css version was possible for any page you viewed, you could open your archives in any browser whatsoever.

I guess that's what's behind that request.

zottel
2006-11-30, 05:14 PM
... which brings me to another question:

Does anyone know more about .webarchives and how that format is actually defined? If it's really a representation of the internal browser model—will it work with future versions, where this model might change?

Forrest
2006-11-30, 10:38 PM
I gotcha. I did some searching for more info on Webarchives, and I did find one app that will extract files from a webarchive. Not sure how well it works. http://www.macupdate.com/info.php/id/20643

Len Case
2006-12-01, 01:00 AM
A webarchive is a serialized form of the record of responses used to create a webpage.

Basically, as each resource is requested (via a image tag, a subframe, javascript, or even a flash plugin request) when the response comes back from the server, the request-response pair is stored in an object which can be serialized as a data file (webarchive). Then if the webarchive is loaded, as each resource reloads, if the same requests are made, instead of going to the server, the data is loaded from the archive instead.

It doesn't actually store all the state of javascript or plugins (hard to do in the first case, and not part of the api for the second).

Len Case
2006-12-01, 01:06 AM
... which brings me to another question:

Does anyone know more about .webarchives and how that format is actually defined? If it's really a representation of the internal browser model—will it work with future versions, where this model might change?
Since all of WebKit is now open source, you can look at the code for yourself and see exactly how webarchives are defined and created--and since they are including the full history in the public repository, you should always be able to read or write any version of webarchive were they to change them in the future.

timb
2006-12-01, 06:37 PM
Well, I'm back and...

...not only did I cough up the 9,95 for the November-sale OmniWeb, albeit I don't even have a Mac to run the latest version (I have a sweet spot for this browser, dunno why)...
I just tried it in Firefox, it doesn't work. If the page is very simple, I'm sure it probably works. I rarely says absolutes, but I would say it's impossible for such a feature to be reliable.
... I did also dust off my b/w G3 (running Jaguar, which is why I can't run OW 5.5) and saved more than half a dozen web pages (w/images) in Firefox (0.9!), transferred the folders and files to a Windoze machine and looked at them (while offline) in IE (6) and others. All but one (my Gmail inbox, I didn't seriously expect that to save correctly) showed up with the content intact. This included the Omnigroup homepage and the OmniWeb features page. The page layout sometimes wasn't reproduced like the original, but I don't care for that. And I know that Opera would have saved it even better.

The main reason for the request was cross-platform and future accessibiltiy. It's the same reason why I prefer to keep my notes in plain-text. Web archives are a joke. PDFs are something completely different than the page itself. I'm used to dig into the source code of pages I've saved and add remarks or do other adjustments (I somtimes even run search-replace operations, to correct errors; this is not about all-English pages, after all). I can't do that with PDFs. And if I try to save/print as an A4 PDF, more often than not the margins of the page will be cut off. I had actually started with trying to archive everything as PDF, but soon abandoned that way.
So I want to uphold my request: Please add a feature that saves web pages as individual files together with their adjacent images etc.

T.

DanielSmith
2007-03-10, 10:21 PM
I know a easy way to do that ,LOL
just save the entire page into a image.
using the system Print Screen Key is not a good idea for it can only record the screen , i am using ACA Capture (http://www.acasystems.com/en/screencapturepro/),it can also capture the other part of the webpage outside the screen.
But if the webpage is saved into image ,it can't be split anymore.

timb
2007-03-25, 05:04 PM
DanielSmith, in which way would that be better than to save as PDF?

AFAICT, saving as PDFs would do this just as well, but the major points in this thread were that:

web archives are proprietary to WebKit browsers and not cross-platform (there aren't any non-Mac WebKit browsers)
while PDFs are kind of cross-platform,

they don't preserve links in the pages
I like to edit the source code of saved pages (add comments or correct errors, even edit links to additional images etc.)
some web pages don't print well at all, some PDF "printouts" have their margins cut off some text etc. etc.

JKT
2007-03-26, 10:48 AM
(timb, this won't help your needs, but I'm posting it as a FYI).

If you Save as PDF (hold option down as you do a Save as...), rather than printing as PDF, they should save with the formatting preserved - note, this method generates a single page PDF file of the site, so if it exceeds the boundary of a single page of paper in the print version, this won't occur. It is useful for sites that need a lot of vertical scrolling to view, if you don't want them to be split at inconvenient locations in the text. However, it isn't so useful if you do actually want a hard copy printout.

I'm hoping Apple will allow the links to remain live in their PDFs in the next version of OS X.

Chiller
2007-03-31, 06:11 PM
An ambitous (and knowledgeable) person could write an AppleScript doing that provided that OmniWeb could list all of the resources on a web page like Firefox does. Then save all of those files into an Archive (Apple zip file).