7, September 2008

Here I am... I begin with a question - webmaster forum

 
Webdigity webmaster forums
This forum shares its ad revenue with its members!
[ Home | Help | Search | Forum's Shop | Archive | Login | Register | Webmaster Directory ]
Webdigity Webmaster Forums  >  WebDigity Community  >  Forum Lounge  >  New Member Introductions
Topic: Here I am... I begin with a question
« previous next »
Pages: [1] Print

Author Topic: Here I am... I begin with a question  (Read 885 times)
Bill Cosby is my Father
*
Posts: 4
28 credits
Members referred : 0



« on: Jul 18, 2007, 12:24:29 AM »

Hi everybody, I'm new to this forum... And I immediately begin with a question.

Does some web crawler follow links inside pages in non-html format?
In particular I'm interested in PDF documents (possibly containing hyperlinks) and DJVU (déjà vu, again with links in them). PDF is a widely spread standard, so may be...

Do someone knows exactly?

Anyway I'm experimenting, some of my PDF papers in http://bodrato.it/papers/ Visit through proxy, contain links to pages not linked elsewhere... I'm waiting to see if some crawler read them...

Let me know,
Marco
I am a metal monkey!
Administrator
Community Supporter ?
Jedai Sword Master
*****
Gender: Male
Posts: 8037
41179 credits
Members referred : 3



« Reply #1 on: Jul 18, 2007, 12:38:11 AM »

Hi Marco, and welcome to our community Smiley

I think you need to extract info from pdf in another way. There are some classes that can read pdf files (which wont really help) and others that convert pdf documents to html, so this way is easy to get links from there.

I don't remember any of those, but I would suggest you to take a look at phpclasses.org. If you can't find anything let us know Smiley

Trial and Error my two best teachers Cool
Join us @ facebook Visit through proxy

Last blog : MIA - Where Nick and Tim
What's HTML?
****
Gender: Male
Posts: 402
2430 credits
Members referred : 2



« Reply #2 on: Jul 18, 2007, 12:46:11 AM »

Welcome marco to the webdigity forums,

I am not familiar with PDF related issues, sorry ...


Last blog : SeoDigger: Free Keyword Research tool
Global Moderator
Community Supporter ?
Jedai Sword Master
*****
Gender: Male
Posts: 6309
38674 credits
Members referred : 374


It's time to use PHP5!


« Reply #3 on: Jul 18, 2007, 09:38:21 AM »

sure PDF's are indexed by Google, but the hyperlink might be noticed (depends on the pdf creator) but these links doesn't have the same "power" like a normal link.

As Nick suggested, convert the pdf's to html there are a lot of online tools, try a simple search in google or copy paste the text (to create clean html)

btw welcome here at webdigity.com Smiley


Last blog : Is your website is down? Know before your visitors do!
Bill Cosby is my Father
*
Posts: 4
28 credits
Members referred : 0



« Reply #4 on: Jul 19, 2007, 09:36:56 AM »

I know PDF are indexed. I was wandering if hyperlink inside them was followed. Anyway, this morning Googlebot gave me the answer, because it visited a link I inserted in a PDF (and only there).

On the other side, it seems that all search engines ignore .djvu files (http://djvu.org/ Visit through proxy).

PS: I do not need any tool to convert PDF to HTML... because I have the LaTeX source of my documents, and I could obtain web pages directly. But those documents are mathematical articles (http://bodrato.it/papers/ Visit through proxy), much more easy to read on paper than on the screen.
Global Moderator
Community Supporter ?
Jedai Sword Master
*****
Gender: Male
Posts: 6309
38674 credits
Members referred : 374


It's time to use PHP5!


« Reply #5 on: Jul 19, 2007, 11:20:12 AM »


PS: I do not need any tool to convert PDF to HTML... because I have the LaTeX source of my documents, and I could obtain web pages directly. But those documents are mathematical articles (http://bodrato.it/papers/ Visit through proxy), much more easy to read on paper than on the screen.

than provide the visitor both versions, most resources of good content provide a web and print version


Last blog : Is your website is down? Know before your visitors do!
I am a metal monkey!
Administrator
Community Supporter ?
Jedai Sword Master
*****
Gender: Male
Posts: 8037
41179 credits
Members referred : 3



« Reply #6 on: Jul 19, 2007, 12:12:48 PM »

Interesting. I didn't know that Google follows links on pdf documents. The next experiment would be to see if those links get better ranking but this would be also impossible to probe Smiley

Trial and Error my two best teachers Cool
Join us @ facebook Visit through proxy

Last blog : MIA - Where Nick and Tim
Bill Cosby is my Father
*
Posts: 4
28 credits
Members referred : 0



« Reply #7 on: Jul 20, 2007, 09:17:10 PM »

Interesting. I didn't know that Google follows links on pdf documents. The next experiment would be to see if those links get better ranking but this would be also impossible to probe Smiley
I'm mainly interested in being indexed, because my research field is quite specialistic... My page rank is not the top, but high enough to be found by those people searching the keywords of my research "Toom Cook".

Anyway the experiment is possible... but it requires too much time for my interest in the answer.

You should:
  • make two identical pages on two different (otherwise empty) sites
  • make two "identical" documents pointing to it, one in HTML and one in PDF on the same dir of a third server
  • link both HTML and PDF on a tenth of pages
  • wait one month or so...
  • search keywords from the two identical pages.... and check which one get first
Then you should check with other keywords, and after another couple of months... to have a proof against "noise".

Too much to wait for such a little info...

--
http://bodrato.it/toom-cook/binary/ Visit through proxy
Trackback URI for this entry : http://www.webdigity.com/trackback.php?topic=6922
Tags : new user google web crawler pdf Bookmark this thread : Digg Del.icio.us Dzone more....

Topic sponsors:
Get a permanent link here for $1.99!


Pages: [1] Print 
Webdigity Webmaster Forums  >  WebDigity Community  >  Forum Lounge  >  New Member Introductions
Topic: Here I am... I begin with a question
« previous next »
Jump to:
User Area
Welcome, Guest. Please login or register.
Did you miss your activation email?
Sep 07, 2008, 06:56:53 PM





Login with username, password and session length

Donate to our community, and get a permanent link back to your site!

Donate to our community, and get a permanent link back to your site!


Forum Statistics
Total Posts: 36.301
Total Topics: 7.479
Total Members: 3.905
Tutorials : 56
Resources : 143
Designs : 220
Latest Member: indiecorporate

25 Guests, 4 Users online :

8 users online today:



Readers

Web Design Gallery · Whois Lookup · Pagerank · Tag Browsing · Lo-fi version · Syndication · Webmaster forum history · Advertise
Developed by HumanWorks © 2005 - 2008 Webdigity webmaster community · sublime directory
Webdigity Webmaster Forums | Powered by SMF 1.0.12. © 2001-2005, Lewis Media. All Rights Reserved.