Re: Unable to complete a full fetch, reason Child Error

2006-02-26 Thread Gal Nitzan
Still got the same...

I'm not sure if it is relevant to this issue but the call you added to
Fetcher.java: 

 job.setBoolean(mapred.speculative.execution, false);

Doesn't work. All task trackers still fetch together though I have only
3 sites in the fetchlist.

The task trackers fetch the same pages...

I have used latest build from hadoop trunk.

Gal.


On Fri, 2006-02-24 at 14:15 -0800, Doug Cutting wrote:
 Mike Smith wrote:
  060219 142408 task_m_grycae  Parent died.  Exiting task_m_grycae
 
 This means the child process, executing the task, was unable to ping its 
 parent process (the task tracker).
 
  060219 142408 task_m_grycae Child Error
  java.io.IOException: Task process exit with nonzero status.
  at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:144)
  at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:97)
 
 And this means that the parent was really still alive, and has noticed 
 that the child killed itself.
 
 It would be good to know how the child failed to contact its parent.  We 
 should probably log a stack trace when this happens.  I just made that 
 change in Hadoop and will propagate it to Nutch.
 
 Doug
 




FW: Good reading/research on PDF text extraction

2006-02-26 Thread Richard Braman
I noticed that nutch seems to have some problems parsing pdfs.
 
060226 131210 fetch okay, but can't parse
http://www.irs.gov/pub/irs-pdf/p1828.pdf, reason: failed(2,203):
Content-Type not text/html: application/pdf
 
I am actually working on PDF parsing technology, and have posted the
following message to 2 Open source pdf projects (PDFBox and iText).  If
there is interested from nutch developers on what responses I have
received , and how a collaborative solution may be reached, let me know.
 
-Original Message-
From: Richard Braman [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 21, 2006 10:36 AM
To: 'itext-questions@lists.sourceforge.net'; '[EMAIL PROTECTED]';
'[EMAIL PROTECTED]'
Cc: '[EMAIL PROTECTED]'
Subject: Good reading/research on PDF text extraction



In 2003, Tamir Hassan wrote a OS program  http://www.tamirhassan.com/
http://www.tamirhassan.com/ to extract text out of PDF tables and
columns and put it into HTML as a part of a University research product.
His algorthims were actually quite sophisticated and well documented in
http://www.tamirhassan.dsl.pipex.com/final.pdf.  

The results were actually quite impressive, as he managed to deal with
columns, etc using what he referred to Intelligent text extraction
algorithm which uses positions to preserve text flow.  He used Jpedal as
his underlying PDF library.

Unfortunately his program was written with an old version of Jpedal and
does not run with the new Jpedal.  This is due to the fact that the
PDFGenericGrouping class he used was changed to PDFGroupingAlgorithms
and moved to non-GPL Jpedal.  The new class also changed some of the old
classes' members from public to private, and deleted some members, which
would make rewriting his app nessesary.

Fast forward to 2005, Christian Leinberger, a colleague of Tamirs,
writes a paper entitled Ideas for extracting data from an unstructured
document
http://www.chilisoftware.net/Private/Christian/ideas_for_extracting_data
_from_unstructured_documents.pdf.  Christian indicated that he is using
the open source  BSD PDFBox as his library for experiementing with
algortihms that can be used to extract text reliabily out of
unstructured PDFs.  

I have contacted these guys and hopefully they will be willing to share
their developments with the PDF community.

As more and more content gets pushed into PDF it looses its meaning to
anyone else other than a human reader or a printer.  Machines do not
have the ability to read and parse it reliably in a generic context, and
it requires sophisticated AI algortihms based on ontologies, or  other
big words, to get it out.  If your lucky, you can hack through it and
get what you need. Something to think about the next time you push
content into a PDF, or even HTML.  PDF is a great way to present content
for priting, but it  [EMAIL PROTECTED] , pardon my french, as a primary 
mechanism for
presenting data that may need to be used by a machine somewhere
downstream.

Getting it out has turned into big business for companies who have
developed technology to get into the PDF and get important data out of
it and into another format, usually XML.  This is a growing space and I
hope that there are some more developers interested in solving the
problem created by PDF crazy folks who have managed to shove valuable
data into PDF while failing to maintain that same data in another more
usable format (e.g. XML ,  or at least tagged PDF ).  It is best that
this is done in an open format, because the value of such technolgy is
very high, it is complicated to produce, and very useful to the general
public.

Richard Braman
 mailto:[EMAIL PROTECTED] mailto:[EMAIL PROTECTED]
561.748.4002 (voice) 

 http://www.taxcodesoftware.org/ http://www.taxcodesoftware.org
Free Open Source Tax Software

 


Release Planning

2006-02-26 Thread Nutch developer
Hello nutch people,

it seems that the next nutch version 0.8 would be a bigger redesign of the
old nutch.
Definitely there will be some very cool features that the world is waiting
for.

Currently me and the other folks in my project thinking about using nutch.
The integration of nutch into our software would start in about two month.

My question:
Is there some release planning for nutch? What is the estimated date for
a stable version of 0.8? Is there a change that 0.8 is available in two
month?

By the way:
What are the criteria for a version 1.0 of nutch?


Thanks!


RE: FW: Good reading/research on PDF text extraction

2006-02-26 Thread Richard Braman
Rakesh,
What developments have been done so far to enable nutch to parse PDFs?
Have you read through Tamir's Whitepaper?
Rich
 
 
 
PS. Here are some comments from Ben Litchfiled, developer of open source
PDF Box (java), followed by some comments from Tamir, who wrote the PDF
extraction algorithm :
Richard,

Are you saying you want to head this type of project up and are looking
for help or are you requesting this functionality be added to existing
projects?

I have worked on a couple different 'custom' text extraction projects
using PDFBox and need to organize those changes before I can commit them
to the PDFBox project. Right now they are very specific/custom so I need
to extract the generic parts out and make them part of the core PDFBox.
Just need to find the time to do it.

Certainly if Christian Leinberger has made some progress I would be
willing to work with him to add some features to the PDFBox core.

I agree that this is important functionality and requires more than just
simple text extraction but advanced AI concepts.

Ben

My response:

I am requesting this functionality be added to existing projects. I am
saying I am available to code, discuss, document, test, support, or
otherwise do whatever else I can do to get some good technology in the
public domain in this area.

Certainly if Christian Leinberger has made some progress I would be 

willing to work with him to add some features to the PDFBox core.

Hopefully they will get back to us all. I would like to see the results.

I would also like to ask Ben, et al if PDFBox supports reading of
tagged PDF, and if so in what classes? 

 

 

 

-Original Message-

From: Tamir Hassan [ mailto:[EMAIL PROTECTED]
mailto:[EMAIL PROTECTED] 

Sent: Thursday, February 23, 2006 5:44 AM

To: [EMAIL PROTECTED]

Subject: Re: Do you still answer this email

 

Dear Richard,

Thanks for your email.

My current situation is that I am working for a project that has a 

commercial partner, who provides part of the funding. This is on the 

understanding that my code and developments will eventually be 

integrated with their existing commercial, non-open-source software.

So, because of this, it is not up to me to decide whether I can share 

some of my developments with the rest of the PDFBox community and with a


compatible licence. I did speak to one of my supervisors today, and he 

did not rule out the possibility, but this would also have to be OK'd 

with several higher members of my department.

I do believe that sharing some of my progress with the community could 

be mutually beneficial. Therefore, I will make a proposal to the people 

in charge of the project, and I will let you know of the outcome. This 

might, however take some time.

I will keep you updated.

Best regards,

Tamir

 

Richard Braman wrote:

 I read your final report, as well as Christians report on converting 

 PDF to XML. I am actullay quite interested in these developments, and 

 would be to contribute time to any projects you guys are undertaking. 

 I am working on a parallel effort to convert government documents into


 structured XML. I am very interested in the technology, and you guys 

 seem to have created some sophisticated contact extraction algorithms 

 to deal with columns, tables, ect.

 

 Have a look at the attached PDF. It contains coumns, and text full of 

 valuable information , formatted in a very unstrucutred way. I tried 

 to run it through your code, but the file is comressed using Flate, 

 and the old jpedal couldn't understand the comression used. I tried 

 running your code on new Jpedal, and the interfaces and classes have 

 changed around greatly. He in fact moved the GenericGrouping class 

 into his non GPL enterprise lib, and changed the name of the class, as


 well as the return types. He also changed some off the class members 

 from public to private, and deleted others. All in all your code 

 would have to be entirely rewritten to use with current Jpedal which 

 is a shame.

 

 Anyways, it seems like you are focusing on PDF Box, which has a better


 license, and developers committed to OS, instead of what Jpedal does 

 now, which is keep only some stuff in GPL, everything that is seeminly


 useful is now in the enterprise library. Are you able to share your 

 developments?

 





[jira] Created: (NUTCH-217) InstantiationException when deserializing Query (no parameterless constructor)

2006-02-26 Thread Dawid Weiss (JIRA)
InstantiationException when deserializing Query (no parameterless constructor)
--

 Key: NUTCH-217
 URL: http://issues.apache.org/jira/browse/NUTCH-217
 Project: Nutch
Type: Bug
  Components: searcher  
Versions: 0.8-dev
Reporter: Dawid Weiss


I've been playing with the trunk. The distributed searcher complains with an 
instantiation exception when deserializing Query. A quick code inspection shows 
that Query doesn't have any parameterless constructor.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira