On 28/12/12 13:26, Abhishek Shivkumar wrote:
Thanks Andy. Awesome!
So, I am downloading the latest dump of Freebase RDF ->
freebase-rdf-2012-12-23-00-00.gz
Let me check with that and use tdbloader to see if it has been corrected.
Also, when will JENA 2.10.0 with this correction, be released?
Development builds are always available:
https://repository.apache.org/content/repositories/snapshots/org/apache/jena/
We don't have a date for the release but
(1) the development builds should pass all the tests - the CI Jenkins
installation will tell the dev mailing list if not!
(2) expect a test/release cycle for 2.10.0 when we ask the community to
test things before a release, and not after, so it that will take a while.
So while not as highly tested, the dev builds are generally pretty good
and contain no known issues that would block as release.
I wrote to freebase-discuss about the 13 issue but running
sed -e 's/\.$/ ./'
might be a good idea.
To load TDB, you're going to need a big RAM machine and patience.
So checking the data first, despite the delay, is going to be a good idea.
tdbloader2 is likely to be faster. There are some tuning parameters as
well - see the script for details. Adding --parallel=3 is good if your
sort(1) supports it.
I'll download the 32-12 data but I don't have access to a large machine
at the moment. I'm going to run "split -l 10000000" to make finding
errors easier.
Andy
Thank you!
With Regards,
Abhishek S
On Fri, Dec 28, 2012 at 6:32 PM, Andy Seaborne <[email protected]
<mailto:[email protected]>> wrote:
On 28/12/12 07:42, Abhishek Shivkumar wrote:
Hi Andy,
Here are the triples from the neighborhood of line 270608. i
tried
finding the error but couldn't. Do you see any by chance?
I printed the line number too on the left just in case. Ex:
"line num
270591-"
Not quite the right line but close ... this may be the problem:
Line:
-----------------
ns:m.01gqn1
ns:base.braziliangovt.__brazilian_political_party.__number 13.
-----------------
and the problem is the 13.
The WG spec in development has:
[21] DECIMAL ::= [+-]? [0-9]* '.' [0-9]+
so a decimal must have a trailing digit, and "13." is integer 13
followed by a DOT (terminates the triples).
But in the W3C submission has a know problem in this area:
[18] decimal ::= ('-' | '+')? ( [0-9]+ '.' [0-9]* |
'.' ([0-9])+ | ([0-9])+ )
and 13. is ambiguous. Is it 13 and a DOT or a decimal with lexical
form "13." The normal way to tokenize is to choose the longest
match (so ":abc" isn't ":a" then "bc") and that means you need a
space to the tokens '13' and DOT
Jena 2.7.4 follows the submission and "13." is a decimal and the
needs a trailing DOT.
In fact, using space-DOT everywhere would be very sensible.
Trailing dots on prefix names may confuse some older parsers.
Jena development (2.10.0) follows the W3C WG spec and it's 13
integer and a trailing DOT and parses.
Do you have a corrected version of freebase-rdf-2012-12-09-00-00? I
downloaded it but there are other things to fix up before it gets to
that point.
Andy