How to Index Pure Text into Seperate Fields?

2010-09-29 Thread Savannah Beckett
Hi,
  I am using xpath to index different parts of the html pages into different 
fields.  Now, I have some pure text documents that has no html.  So I can't use 
xpath.  How do I index these pure text into different fields of the index?  How 
do I make nutch/solr understand these different parts belong to different 
fields?  Maybe I can use existing content in the fields in my index?
Thanks.


  

Re: How to Index Pure Text into Seperate Fields?

2010-09-29 Thread Erick Erickson
Can you provide a few more details? You mention xpath, which leads me
to believe that you are using DIH, is that true? How are you getting
your documents to index? Parts of a filesystem?

Because it's possible to do many things. If you're using DIH against a
filesystem,
you could use two fileDataSources, one that works only on files with
a particular extension (xml, say) and another that processes .txt files.

But that said, if you're trying to index just the text of a Word document,
you
have to parse it quite differently than a plain text file, take a look at
Tika.

Al of which may not help you at all, because I'm guessing...

So I think a more complete problem statement would help us help you.

Best
Erick

On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett 
savannah_becket...@yahoo.com wrote:

 Hi,
   I am using xpath to index different parts of the html pages into
 different
 fields.  Now, I have some pure text documents that has no html.  So I can't
 use
 xpath.  How do I index these pure text into different fields of the index?
 How
 do I make nutch/solr understand these different parts belong to different
 fields?  Maybe I can use existing content in the fields in my index?
 Thanks.





Re: How to Index Pure Text into Seperate Fields?

2010-09-29 Thread Savannah Beckett
No, I am using xpath for html, this is not the question.  I am indexing pure 
text in addition to html that I was indexing.  Pure text like TXT file or 
Microsoft Word doc.  So, no xpath for TXT, how do I index TXT file into 
different fields in my index like the way I use xpath to index html into 
differernt fields in my index?

My question is referring to pure TXT like .txt file and microsoft word, not 
html.  I am completely fine with html.
Thanks.





From: Erick Erickson erickerick...@gmail.com
To: solr-user@lucene.apache.org
Sent: Wed, September 29, 2010 2:59:26 PM
Subject: Re: How to Index Pure Text into Seperate Fields?

Can you provide a few more details? You mention xpath, which leads me
to believe that you are using DIH, is that true? How are you getting
your documents to index? Parts of a filesystem?

Because it's possible to do many things. If you're using DIH against a
filesystem,
you could use two fileDataSources, one that works only on files with
a particular extension (xml, say) and another that processes .txt files.

But that said, if you're trying to index just the text of a Word document,
you
have to parse it quite differently than a plain text file, take a look at
Tika.

Al of which may not help you at all, because I'm guessing...

So I think a more complete problem statement would help us help you.

Best
Erick

On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett 
savannah_becket...@yahoo.com wrote:

 Hi,
  I am using xpath to index different parts of the html pages into
 different
 fields.  Now, I have some pure text documents that has no html.  So I can't
 use
 xpath.  How do I index these pure text into different fields of the index?
 How
 do I make nutch/solr understand these different parts belong to different
 fields?  Maybe I can use existing content in the fields in my index?
 Thanks.






  

Re: How to Index Pure Text into Seperate Fields?

2010-09-29 Thread Lance Norskog
Simple text .txt files and MS office .doc files are very very different beasts.
You can do simple .txt files with some more lines in your
DataImportHandler script.
With DOC files it is easiest to use the extracting request handler
*/extract. This is on the wiki.
If you want to do this inside the DataImporthandler, you need to use
3.x or the trunk. And it has bugs.

On Wed, Sep 29, 2010 at 3:55 PM, Savannah Beckett
savannah_becket...@yahoo.com wrote:
 No, I am using xpath for html, this is not the question.  I am indexing pure
 text in addition to html that I was indexing.  Pure text like TXT file or
 Microsoft Word doc.  So, no xpath for TXT, how do I index TXT file into
 different fields in my index like the way I use xpath to index html into
 differernt fields in my index?

 My question is referring to pure TXT like .txt file and microsoft word, not
 html.  I am completely fine with html.
 Thanks.




 
 From: Erick Erickson erickerick...@gmail.com
 To: solr-user@lucene.apache.org
 Sent: Wed, September 29, 2010 2:59:26 PM
 Subject: Re: How to Index Pure Text into Seperate Fields?

 Can you provide a few more details? You mention xpath, which leads me
 to believe that you are using DIH, is that true? How are you getting
 your documents to index? Parts of a filesystem?

 Because it's possible to do many things. If you're using DIH against a
 filesystem,
 you could use two fileDataSources, one that works only on files with
 a particular extension (xml, say) and another that processes .txt files.

 But that said, if you're trying to index just the text of a Word document,
 you
 have to parse it quite differently than a plain text file, take a look at
 Tika.

 Al of which may not help you at all, because I'm guessing...

 So I think a more complete problem statement would help us help you.

 Best
 Erick

 On Wed, Sep 29, 2010 at 3:56 PM, Savannah Beckett 
 savannah_becket...@yahoo.com wrote:

 Hi,
  I am using xpath to index different parts of the html pages into
 different
 fields.  Now, I have some pure text documents that has no html.  So I can't
 use
 xpath.  How do I index these pure text into different fields of the index?
 How
 do I make nutch/solr understand these different parts belong to different
 fields?  Maybe I can use existing content in the fields in my index?
 Thanks.










-- 
Lance Norskog
goks...@gmail.com