subject:"General interest question\: PDF contents handling in PostgreSQL."

Re: General interest question: PDF contents handling in PostgreSQL.

2002-11-27 Thread Fabián R. Breschi

Thanks a lot for the advise,

Let you know.

Fabian.

Perrin Harkins wrote:

 Fabián R. Breschi wrote:

I wonder if using ModPerl and PostgreSQL there's any possibility
 to resemble what in Oracle is called 'Intermedia', in this particular
 case parsing/indexing content of PDF files inside PostgreSQL as  a
 LOB or alternatively as a flat OS file with metadata parsed/indexed
 from it into the RDBMS.


 You can easilly add this to DBIx::FullTextSearch.  All you need to do
 is write a simple frontend that uses a PDF reading module to extract
 the text.  However, it uses MySQL rather than PostgreSQL.

 - Perrin

Re: General interest question: PDF contents handling in PostgreSQL.

2002-11-27 Thread Fabián R. Breschi

Rob Nagler wrote:

I agree with you with the overhead provoked by the Oracle solution. Particularly, using intermedia with the 'Internet File System' option of 8i/9i things get extremely complex in terms of manageability. On the other hand, the user friendly interface that allows to drop a file into the DB and get indexed on the fly has a high cost in terms of system

It isn't indexed on the fly in our version (8i). Has this changed?You have to run the indexer regularly, so in this it is no better thanexternal indexing solutions. Indeed, one of the big problems is thatyou can't qualify the query *prior* to index search afaik. It seemsto search the entire index always. In our case, this is extremelycostly, because our space naturally divides, and isolated indexeswould solve the problem much more efficiently.

Oracle claims to get the file search within the DB at a fraction of time
respect to MS flat files in the IFS solution with 8i onwards (Enterprise
edition), obviously it doesn't mean that indexing performs well compared
to an analog solution. Didn't pay attention to the fact of the reindexing
after dropping a doc inside IFS since I have definitively abandoned the idea
due to performance issues. Looking backwards to the history, from Context
to Intermedia, now the solution has become 'UltraSearch' for which I personally
have to get acquainted about improvements.

resources, for my personal point of view this particular workflow did not scale well with existing systems having installed only the RDBMS with no spare capacity, specially in terms of CPU/Memory resources.

It scales enough, if you aren't trying to solve the googleproblem. :-) For our users, it's ok performance, even for the heavyinternal users. Just being able to search message boards and fileareas (including word docs) is huge plus for us.

Ihave tried IFS within a system doing well the RDBMS job for a lo-mid sized/tuned
configuration using Solaris 2.6 and Sun Sparc. IFS made to us the horrible
first impression of putting the system down in it's knees. Frankly, didn't
had the time/patiente to understand if there was a chance to tune-up a little
more and accomplish with the scalation, in my opinion it should have been
a waste of time for that particular situation without a real machine scalation.

Following your suggestion, I could drop the PDF textual contents achieved using pdftotext to a 'TEXT' datatype into a PostgreSQL, then use a search engine to look inside it to resemble a similar functionality regarding intermedia.

Regarding the search engine, guess that it should be necessary to have at least a de-structurated text search algorithm along with something like SOUNDEX in Oracle.

I don't think intermedia uses SOUNDEX. It does pure keywordmatching. It's particularly bad in my opinion. It also doesn't learnwhat people really want to know. For example, if you search:http://www.bivio.com/pub/search?s=taxesYou always get the IRS Pubs, but this is rarely what people arelooking for on our site (although they should read the publications,they are more interested in what bivio can do for them in terms oftaxes). Note the performance on the search. The data set you aresearching in the public case is very small in comparison to the wholedocument database which is multi-GB.Hope this helps.Rob

I'm not sure what intermedia uses to search text, certainly it don't learns
anything about searches (don't know what 'Ultrasearch' is capable of despite
all the hyphe Oracle is putting into this technology as usually) . Regarding
the search in bivio.com, it's quite okay in terms of human-awareness response
but probably it should do better thinking in terms of a 12 pages indexed
data set.

Thanks a lot for your valuable suggestions. I will let you know just in case
of further evolution from what we've talking about.

All the best.

Fabian.

General interest question: PDF contents handling in PostgreSQL.

2002-11-26 Thread Fabián R. Breschi

Dear Group,

   I wonder if using ModPerl and PostgreSQL there's any possibility to 
resemble what in Oracle is called 'Intermedia', in this particular case 
parsing/indexing content of PDF files inside PostgreSQL as  a LOB or 
alternatively as a flat OS file with metadata parsed/indexed from it 
into the RDBMS.

For what I can understand, this issue may involve directly PostgreSQL 
thought as having an analog functionality compared with Oracle 8i/9i so, 
as far as I know this feature is not implemented natively but probably 
could has been developed aside as a procedural object or similar.

Perhaps something exists in regards of ModPerl used along the RDBMS itself.

Any suggestion will be highly appreciated.

Many thanks indeed.

Fabian R. Breschi

Re: General interest question: PDF contents handling in PostgreSQL.

2002-11-26 Thread Rob Nagler

Fabián R. Breschi writes:
 I wonder if using ModPerl and PostgreSQL there's any possibility to 
 resemble what in Oracle is called 'Intermedia', in this particular case 
 parsing/indexing content of PDF files inside PostgreSQL as  a LOB or 
 alternatively as a flat OS file with metadata parsed/indexed from it 
 into the RDBMS.

We use Intermedia and Postres on separate projects.  Oracle's PDF
parsing can be emulated with pdftotext.  You'll need a search engine.
Frankly, I'm not totally pleased with Intermedia.  It's indexer is
slow, and you have to re-optimize often.  This affects a bunch of
stuff related to the database, e.g., redo logs, which makes db
management more difficult.  If I had the time, I'd probably drop it. 

Rob

Re: General interest question: PDF contents handling in PostgreSQL.

2002-11-26 Thread Perrin Harkins

Fabián R. Breschi wrote:

   I wonder if using ModPerl and PostgreSQL there's any possibility to 
resemble what in Oracle is called 'Intermedia', in this particular case 
parsing/indexing content of PDF files inside PostgreSQL as  a LOB or 
alternatively as a flat OS file with metadata parsed/indexed from it 
into the RDBMS.

You can easilly add this to DBIx::FullTextSearch.  All you need to do is 
write a simple frontend that uses a PDF reading module to extract the 
text.  However, it uses MySQL rather than PostgreSQL.

- Perrin

Re: General interest question: PDF contents handling in PostgreSQL.

Re: General interest question: PDF contents handling in PostgreSQL.

General interest question: PDF contents handling in PostgreSQL.

Re: General interest question: PDF contents handling in PostgreSQL.

Re: General interest question: PDF contents handling in PostgreSQL.

5 matches

Site Navigation

Mail list logo

Footer information