|On philology, potatoes and construction.|
Well, this is just my first approach to blog-writing. I want it to be the way to keep in touch with colleagues and friends.
|Cosine distance and semantic relations|
|Following a previous post that illustrated the meaning of a word based on its context, a script to calculate the cosine distance among terms. The idea behind is that if you have enough data, semantic relations can be solved from context. As a practical application, given a place name, the cosine distance measures the proximity (or distance) of a geographical entity relative to two geographical features used as semantic tracts (e.g. to determine which entities are islands).|
In the script linked above I use the data given as a toy example by Baroni,Bernardi, & Zamparelli (2014, pp. 248-250) to introduce the measure in terms of distributional semantics.
|Ice on the river|
Once solid soil to step on, ice melts again.
Winter is gone, in the river the water runs.
|How long is a league?|
We are talking about length, of course. Though, even after disambiguation, the answer may change depending on our definition. If we take metres to be the standard unit of length, ?a unit of length equal to X metres? could be an acceptable definition. So easy, isn't it? Now we only have to answer the question, how many metres are there in a league?
Here is where we could get lost trying to be more precise again. Conversion values change from one place to another since the early periods of history. Furthermore, we would first need to use standard units other than metres, such as stades and feet, then convert to metres again.
So, here we are, tracking Mendes Pinto with distances given in leagues. The issue is that, after working with a huge bibliography, I haven't found a definitive explanation on how the values were calculated if any values were given at all. The most secure point I got is that of a league being equal to 17.5 intervals of a degree.1
Having this as a given value, I approached the issue as follows:
- Mendes Pinto reports measures for distance and position being taken after the sun.
- Now, what Fernão Mendes Pinto really measures is an angle of the Earth's circumference, taking relative positions on its surface with the sun as a point of reference.
- The value of the sphere is well known for this period (Pedro Nunes, Sacrobosco), as nowadays, it has 360 degrees.
- Now, FMP is giving distances proportional to 360/17.5. FMP doesn't need to know how long planet Earth is on the meridian, he just measures parts of it and calls a unit of length in his path a league. This universal unit would be finally transferable to whatever the local common unit of length, be it stades, feet, and so on.
- Since we want to know the value nowadays, we don't need to solve how many stades or feet a league has either. It is enough to know the value of what FMP measures in metres, that is, the circumference of Earth, which is 40,000,000 m (round number, so a metre can also be exactly defined after the distance between the poles and the equator).2
- Now, the maths. 40,000,000 m % 360° = 111,111.1 m, that we further divide 111,111.1 % 17.5 = 6,349 m (6,350 m for the length of Earth's circumference being equal to 40,007,863 m.)
Therefore, following this approach, a league is 6.349 km.
Let's check it! According to FMP, distance from Nanjing to Beijing is 180 leagues. So 180 leagues x 6.349 km = 1,142.82 km. Actual distance, round number, 1,150 km!3
More examples needed to actually bring evidence. Close enough to be on the right track!
Albuquerque, L. (1987). As navegações e a sua projecção na ciência e na cultura
. Lisboa: Gradiva. P. 49. ^
Wikipedia. Sub voce Earth
. http://en.wikipedia.org/wiki/Earth ^
Alves, J. (ed.) (2010). Fernão Mendes Pinto and the Pegrinação
. Lisbon: Fundação Oriente. (Notes by M. Ollé, Vol. III, p. 126) ^
|More on Travels|
Last week I had the privilege to participate in a colloquium again. The topic was place names identification in Mendes Pinto's travels.
This is a synopsis:
"We present here our current work on a positivist analysis of place names. It aims to be valuable for either a literary reading or a more strict historical and geographical interpretation of Pinto's work. We sketch three methods to trace a place name.
1) Through phonetic analysis, does a given place name match a geographical entity in an Asian language?
2) By examining context, do we find historical, ethnographic, topographic, architectural, any descriptive features to relate the place name to a particular geographical area?
3) Given a point of reference and a vector of displacement, is it possible to solve the place name through cartography? "
A pdf file with slides is available here
Many miles away, autumn may-trees,
bore sweeter berries, less lobulated leaves.
|Is human language a system made of an infinite number of expressions? (and II)|
Allowing all possible combinations for the whole English vocabulary, using one of the longest sentences ever registered in this language, was initially solved to infinity. As these were the expected results, being urged to write my presentation, I sent a draft to Miro Moman
who kindly and quickly pointed out that my last operation was rather 1.63585694x1023086
. This is a very huge number indeed (compare with the upper bound of the physical universe
), though still finite. All right again, who cares about a number which is bigger than the volume of the physical universe? Isn't that infinite enough?
Well, it is all right not to care that much. As I told you when this blog started six years ago, when we were only 6 billion people in this planet, in this section I deal with subjects that could be of interest for four or five people in the whole of humanity at present time. We are more than 7 billion now, the number could have slightly increased... though only by one unit the most if we keep the proportion we had six years ago.
Yet, the issue is, 1.63585694x1023086
stands for the result of applying maxima that over-represent the combinatory potential of any human language. That is, a more approximate value will always be smaller if we only apply syntactical rules! An easy example, this unrestricted combinatory would allow the same word to be repeated up to 4391 times and yet would consider the resulting string a sentence.
Let's go small to try to understand better. Let's take the first branch of the Mabinogi from our corpus
: 1605 word-types
, very small lexicon, the longest sentence has 64 words
(this well represents a high value for sentence length). If we allow all possible combinations, that is, the same word to appear in any position of the sentence, we get 160564
, still bigger than the volume of the universe! However, you will soon notice that with such a small lexicon the number of grammatical sentences (not to tell you if we add semantics) must be finite and for sure much smaller!
You can tell me, what about recursion, and adding a loop that infinitely embeds a sentence within a sentence (using a complementizer in English, for instance)? ... all right, go on, you can move towards infinity as much as you want, and indeed create the longest sentence ever... though only when you reach an end (a sentence has to be complete to be a sentence) you will have a sentence unit, a single one. More important, at that point, even if you got a result bigger than 1.63585694x1023086
, it would be finite again.
So, sentences are more similar to the lexicon than I previously thought. As far as I understand, the set of sentences in any human language is finite
, boundless as the vocabulary of a language, and very huge, though much smaller than the maxima given above. Following the more manageable example for the Mabinogi corpus, using it as a rough extrapolation, it would be a matter of adding rules to come down to our solar system and begin to get a number at least smaller than the volume of our physical universe.
|Is human language a system made of an infinite number of expressions? (I)|
Last week I had to give a conference about my research.
This was the introduction:
?Human language can be viewed as a system made of a finite number of units from where an infinite number of expressions is generated. Given any particular language, we have a finite and small number of phonemes, the distinctive sound units, that can be combined to form thousands of morphemes, with no upper bound to create new entries in the lexicon. Words themselves are combined to make up an infinite number of sentences.
When we learn a new language, or when we start to learn to speak, we are expected to learn the words that are more widely used first. In corpus linguistics we refer to this statistical property of words as word frequency. Our model focuses on a practical use of words based on their frequency. Words that have a higher frequency, that is, more occurrences in a given corpus we choose to represent a particular and real use of a language, appear in a higher number of sentences, hence they would be more useful to understand more sentences in a given language.
In order to grasp the meaning of a sentence, we also need to understand its syntax, the special relations that bind words together. Our model uses a very basic approach to syntax. We attempt to approximately determine which sentences would involve more processing in order to be efficiently parsed using words as units, sentence length the quantitative variable. We hence understand a sentence as a discrete variable made up of a number of words (tokens). We consider the number of words to be relevant for sentence complexity, defined here as the higher or lower number of syntactic rules we would need to efficiently parse a sentence. Long sentences involve more syntactic relations, so they are more difficult to parse than the shorter ones.?
On the process of writing the presentation I began to work on an easy explanation on how from an initial finite number of phonemes, a countable number of syllables is formed to make up a finite, though boundless, number of words in a language. From there, I wanted to show that the number of sentences was infinite, so I made some basic operations taking maxima as values:
* Leaving obsolete words apart, the total vocabulary in a dictionary is considered relevant for combinatory purposes. Let's take Oxford dictionaries
as an example: 171,476 words in current use + around 9,500 derivatives: 180,976 words.
* Only lexemes are considered. Inflectional word-types and derivatives not listed in the dictionary are not counted.
* No grammatical restrictions are applied for the combination of words in a sentence.
* A sentence with an exceptional high number of words in English language is taken as the maximum value for the number of words in a sentence: Molly's monologue is 4,391 words (James Joyce).
The picture above shows the result of the final operation.
|Seasonable season's greetings|
Seasonable season, the snowy winter,
And long-lasting ice in Ulaanbaatar.
|The last one (at least took part)|
|The UNDL Foundation organised the II UNL Olympiad, an event to promote the use of the Universal Networking Language (UNL). Among other uses, the UNL is intended to serve as a pivot language for a global machine translation system.|
I decided to participate less than a week before the deadline. It took me three days to prepare the basic corpus and a grammar.
To be the last one in the only modality I took part in!
Congratulations to all the winners, silver and bronze medals included.
|More on travels|
While still being busy with the edition of the glossary of place names of Mendes Pinto, I got in touch with Anton Jankovoy, an outstanding travel-photographer. It was many different subjects that led me to contact him to edit an article he wrote. First, the fact that I was into publishing at the time. Second, and most important, Anton is a traveler in Asia and his pictures are located in some of the most fascinating and inaccessible places on Earth. Finally, his work on astronomy brings a clear image of the sky, highlighting some aspects that were key to astronomical navigation. Rotation of the Earth appears vivid in the star tracks of Anton's pictures, so does the concept of the stars as a reference to find the celestial poles and hence define latitude, the only geographic coordinate used by Mendes Pinto on his travels (combination of latitude with distances in leagues and cardinal directions allowed location of places without longitude).
This is a short e-book on photography and the night sky. To me it was an excellent opportunity to learn more about topics that are extremely useful to better understand geographical description at the time of transoceanic astronomical navigation.
Google Books offers a good sample of Anton's e-book. Different formats can be found for sale at Lulu, Google Play and Amazon.
As for my e-book on place-names, most of the contents are available on Google Maps. If you have a particular interest in the field or you are a researcher on a related subject, you can contact me to send you a free PDF file with the whole list of place-names I have researched.
I am working on an e-book that sums up all the place names studied at http://goo.gl/iqc3P
Most of the contents are already available on the map, the book is only a revision, some entries having more content due to character limitation in Google Maps and a more overall readable disposition.
Below you can find the first draft for an
Have you ever watched a film, read a book, played a game, looked at a picture that caused an impression on you, though, by the time you were supposed to enjoy it, you were urged, compelled, simply ought to move on? Several years had to pass by until you find the time and the place. Now, with less urgency, you actually begin to get into the most it has to offer, at different moments your attention becomes more intense, until you finally feel yourself like bound by an attachment which goes from initial great admiration to the discovery of a new experience far beyond the mere act of watching a film / reading a book / playing a game / looking at a picture?
This is what happened to me with Fernão Mendes Pinto?s travels.
I first scanned the Peregrinaçam
(The travels of Mendes Pinto
) at a time when different books were piled one upon the other, as compulsory readings with fixed deadlines. Not the best way to enter into subtle, secondary (or even less than secondary) issues such as place names. Enough, however, to get touched by a book that resembled a film of adventures with great sailors, pirates, travelers, warriors, diplomats, merchants, sages, and, above all, exotic places. Enough also to come back, this time just for the pleasure of reading, some years later, during the resting time for lunch or between shifts and several winter weekends (Peregrinaçam
is a long book to read). And to return once more, and again, as if I needed to get the story to be continued with new chapters emerging from the vast collection of traveled locations.
This is how my personal quest for the Ilha de Ouro
(The Island of Gold
, a very seriously and repeatedly sought island during the sixteenth and even seventeenth centuries, probably inexistent as such) began, and that of Tajampura
(a port for diamonds trade), and the land of the Oqueus
(hunters blamed to be wild and fierce and riders of the most strange mounts), and the collection of references that lead to places such as Quangepaarù
(the very rich city where the Emperor of Cauchenchina lived most of the year) and Ocumchaleu
(ancient city that, Pinto refers as told by an eremit reader of the ancient chronicles, went under the waters of a lake near Sauady
). To do so, obviously I had to start with the most evident and secure locations. And not being a geographer, I had to rely on the work of previous and present researchers who were, and are, sure, more qualified to achieve the final goal of locating most of the place names mentioned by Mendes Pinto.
When I began to draw my first more methodic maps, two years ago, there were already published works that had undertaken a careful research on the place names cited at Peregrinaçam
. The studies of Visconde da Lagoa and Le Gentil set up the starting point. Both works are rare even in specialized libraries nowadays, in fact, I couldn?t consult any of them, except the maps originally published by Visconde da Lagoa, edited as a supplement in a later edition of Pinto?s works (Pinto 1989). Subsequent research attends the subject either more superficially, either covering partial geographies only or, as in the case of the most complete study ever I found on Peregrinaçam
(Alves 2010), do not try to locate all the placenames in a map.
Hence the need to create my own cartography. And that is the main contribution of this work as, for the contents being, most of the data presented here is only a compilation of information already published. If any merit this approach has, it would be the result of simply putting together different references to discuss each particular place name, see where, and if, the authors agree, and finally conclude a list with the most probable and secure locations.
In some extraordinary cases, when I found resources enough or had achieved more secure / probable related locations, I dare to identify or approximately locate new place names that I haven?t found in any other previous work.
More often, when authors hold different opinions and there is enough evidence in Mendes Pinto?s explanations, I also analyze the data and offer a location as more probable than others.
Now, my main aim when publishing this list of place names and its related map is to offer a new tool for the present traveler, researcher, archeologist, sailor, or whoever could be interested, to confirm the less probable locations, to identify new place names or just to serve as a guide to follow Mendes Pinto travels with the visual aid of a map.
May you find it as stimulating as I did.
|Toponomy of the world|
Brief overview of the identification of placenames in Mendes Pinto's Travels.
First step was indexing all placenames. This has been hand-made. I also wrote a machine extraction algorithm that, by now, does not differentiate anthroponomic entries from placenames (based on typography only). Next came documentation and finally mapping.
The main goal is always finding out the most approximate geographic coordinates for a given placename. Homophony (the fact that a name sounds similar to a placename in other language) is the main means of identification, though not the most important when other different parameters can be considered: distance (leagues mainly; less frequently, latitude degrees), direction, time spent traveling from already identified locations; historical and cultural references; landscape and urban description, and overall geographic context.
The map with comments can be found here: http://goo.gl/iqc3P