Topic: Here I am... I begin with a question (Read 885 times)
Bill Cosby is my Father
Posts: 4
28 credits Members referred : 0
« on: Jul 18, 2007, 12:24:29 AM »
Hi everybody, I'm new to this forum... And I immediately begin with a question.
Does some web crawler follow links inside pages in non-html format? In particular I'm interested in PDF documents (possibly containing hyperlinks) and DJVU (déjà vu, again with links in them). PDF is a widely spread standard, so may be...
Do someone knows exactly?
Anyway I'm experimenting, some of my PDF papers in http://bodrato.it/papers/, contain links to pages not linked elsewhere... I'm waiting to see if some crawler read them...
Let me know, Marco
I am a metal monkey!
Administrator Community Supporter?
Jedai Sword Master
Gender:
Posts: 8037
41179 credits Members referred : 3
« Reply #1 on: Jul 18, 2007, 12:38:11 AM »
Hi Marco, and welcome to our community
I think you need to extract info from pdf in another way. There are some classes that can read pdf files (which wont really help) and others that convert pdf documents to html, so this way is easy to get links from there.
I don't remember any of those, but I would suggest you to take a look at phpclasses.org. If you can't find anything let us know
Global Moderator Community Supporter?
Jedai Sword Master
Gender:
Posts: 6309
38674 credits Members referred : 374
It's time to use PHP5!
« Reply #3 on: Jul 18, 2007, 09:38:21 AM »
sure PDF's are indexed by Google, but the hyperlink might be noticed (depends on the pdf creator) but these links doesn't have the same "power" like a normal link.
As Nick suggested, convert the pdf's to html there are a lot of online tools, try a simple search in google or copy paste the text (to create clean html)
Bill Cosby is my Father
Posts: 4
28 credits Members referred : 0
« Reply #4 on: Jul 19, 2007, 09:36:56 AM »
I know PDF are indexed. I was wandering if hyperlink inside them was followed. Anyway, this morning Googlebot gave me the answer, because it visited a link I inserted in a PDF (and only there).
On the other side, it seems that all search engines ignore .djvu files (http://djvu.org/).
PS: I do not need any tool to convert PDF to HTML... because I have the LaTeX source of my documents, and I could obtain web pages directly. But those documents are mathematical articles (http://bodrato.it/papers/), much more easy to read on paper than on the screen.
Global Moderator Community Supporter?
Jedai Sword Master
Gender:
Posts: 6309
38674 credits Members referred : 374
PS: I do not need any tool to convert PDF to HTML... because I have the LaTeX source of my documents, and I could obtain web pages directly. But those documents are mathematical articles (http://bodrato.it/papers/), much more easy to read on paper than on the screen.
than provide the visitor both versions, most resources of good content provide a web and print version
I am a metal monkey!
Administrator Community Supporter?
Jedai Sword Master
Gender:
Posts: 8037
41179 credits Members referred : 3
« Reply #6 on: Jul 19, 2007, 12:12:48 PM »
Interesting. I didn't know that Google follows links on pdf documents. The next experiment would be to see if those links get better ranking but this would be also impossible to probe
Interesting. I didn't know that Google follows links on pdf documents. The next experiment would be to see if those links get better ranking but this would be also impossible to probe
I'm mainly interested in being indexed, because my research field is quite specialistic... My page rank is not the top, but high enough to be found by those people searching the keywords of my research "Toom Cook".
Anyway the experiment is possible... but it requires too much time for my interest in the answer.
You should:
make two identical pages on two different (otherwise empty) sites
make two "identical" documents pointing to it, one in HTML and one in PDF on the same dir of a third server
link both HTML and PDF on a tenth of pages
wait one month or so...
search keywords from the two identical pages.... and check which one get first
Then you should check with other keywords, and after another couple of months... to have a proof against "noise".