Re: Hive MS SQL Server

2011-03-30 Thread shared mailinglists
Thanks Appan, the goal was just to have the metadata backing Hive in
SQLServer not the hadoopp data itself.. Our DBA.s monitored the sql
generated by DataNucleus against sql server and were typically non to happy
:-) Hive  SQL server therefore is a no goer for us at the moment so were
looking at alternative approaches.

Many thanks,

Andy,

On 28 March 2011 17:16, Appan Thirumaligai athirumali...@ngmoco.com wrote:

 Andy,

 Here is a short term work around - If your goal is to backup data in SQL
 Server just copy the data out of hadoop and import it into SQL (DTS /
 BCP/SSIS based on which version you are in). Email Viral (I guess he is
 still active in the email list) and I'm sure he is using Hive, Hadoop  SQL
 Server.

 Thanks,
 Appan

 On Mar 25, 2011, at 1:59 AM, shared mailinglists wrote:

 Good Morning,

 Our DBA's created a new schema associated with the database and then made
 that the default schema for our hive user, unfortunately this resulted in
 the same problem in the logs…


 *“Check of existence of COLUMNS returned table type of VIEW”*

 * *

 *… *in that Hive still sees the default SQL Server COLUMNS view and
 therefore does create its own COLUMNS table.


 Is there any way we can configure Hive to use a different table name or any
 other approaches we could try ?


 Many thanks,


 Andy.


 On 24 March 2011 17:23, shared mailinglists shared.mailingli...@gmail.com
  wrote:

 Hi Carl,

 Many thanks for your suggestions I will put these to our DBAs and see if
 we can disable the default schema :-) Will post back soon.

 Cheers  thanks for the rapid replies guys,

 Andy.


 On 24 March 2011 17:12, Carl Steinbach c...@cloudera.com wrote:

 Hi Andy,

 From what I understand SQLServer has the notion of a default schema
 (usually dbo) which is used to resolve identifiers that are not defined in a
 user's current schema. I think you need to either undefine the default
 schema for your metastore user account, or else make it point to the
 metastore schema.

 Here are some relevant links with more information:

 http://msdn.microsoft.com/en-us/library/ms190387.aspx

 http://stackoverflow.com/questions/3806245/sql-server-schema-and-default-schema

 http://dba.fyicenter.com/faq/sql_server_2/Default_Schema_of_Your_Login_Session.html

 Hope this helps.

 Carl


 On Thu, Mar 24, 2011 at 9:26 AM, Edward Capriolo 
 edlinuxg...@gmail.comwrote:

 On Thu, Mar 24, 2011 at 11:36 AM, shared mailinglists
 shared.mailingli...@gmail.com wrote:
  Thanks Bernie, hopefully they will.
 
  Were a small Java development team within a predominately MS
 development
  house. We’re hopefully introducing new ideas but the normal company
 politics
  dictate that we should use SQL Server. That way maintenance, backup,
 recover
  etc etc can be handed over to the internal MS db team while freeing us
 guys
  to concentrate on better things like Hadoop  Hive :-) I assumed with
 the DB
  just being a metadata store that the database wouldn’t be an issue but
 were
  struggling a bit:-(
 
  On 24 March 2011 15:23, Bennie Schut bsc...@ebuddy.com wrote:
 
  Sorry to become a bit offtopic but how do you get into a situation
 where
  sqlserver 2005 becomes a requirement for a hive internal meta store?
 
  I doubt many of the developers of hive will have access to this
 database
  so I don't expect a lot of response on this. But hopefully someone
 can prove
  me wrong :)
 
  Bennie.
 
 
  On 03/24/2011 04:01 PM, shared mailinglists wrote:
 
  Hi Hive users :-)
 
  Does anybody have experience of using Hive with MS SQL Server 2005?
 I’m
  currently stumped with the following issue
  https://issues.apache.org/jira/browse/HIVE-1391 where Hive (or
 DataNucleus?)
  confuses the COLUMNS table it requires internally with that of the
 default
  SQL Server sys.COLUMNS or information_schema.COLUMNS View and
 therefore does
  not automatically create the required metadata table when running
 the Hive
  CLI.
 
 
  Has anybody managed to get Hive to work with SQLServer 2005 or know
 how I
  can configure Hive to use a different table name to COLUMNS ?
 Unfortunately
  we have to use SQL Server and do not have the option to use Derby or
 MySQL
  etc.
 
  Many thanks,
 
 
  Andy.
 
 
 
 

 Let us not forget that M$ SQL Server is very advanced. It has for a
 long time supported many types of things that mysql just plain did
 not. (Did we all forget then mysql 3.X days where we had no
 Transactions or Foreign keys? :)

 There was one ticket I closed on it.
 https://issues.apache.org/jira/browse/HIVE-1391

 As far as hive is concerned, m$ SQL server is JPOX/Data Nucleus
 supported so it should work. How many deployments exist in the wild
 are unknown.








Hive in a readonly mode

2011-03-30 Thread Guy Doulberg
Hey all,
I bet someone has already asked this question before, but I couldn't a thread 
with an answer to it,

I want to give analysts in my organization access to hive in a readonly way,
I.E, I don't want them to be able to create, drop tables, Alter tables , insert 
or load.

How can I do that?

Thanks, anyway





insert - Hadoop vs. Hive

2011-03-30 Thread David Zonsheine
Hi,

I'm trying to compare adding files to hdfs for Hive usage using Hive inserts
vs. adding to the hdfs directly then using Hive.

Any comments, blogging about this?

Thanks a lot,

David Zonsheine


Re: Hive in a readonly mode

2011-03-30 Thread Edward Capriolo
On Wed, Mar 30, 2011 at 9:29 AM, Guy Doulberg guy.doulb...@conduit.com wrote:
 Hey all,

 I bet someone has already asked this question before, but I couldn't a
 thread with an answer to it,



 I want to give analysts in my organization access to hive in a readonly way,

 I.E, I don't want them to be able to create, drop tables, Alter tables ,
 insert or load.



 How can I do that?



 Thanks, anyway







Hive 0.7.0 has grant/revoke syntax.

If you are looking for a simple solution in a pre hadoop security
world: Recursively  own /user/hive/warehouse by a user different then
the user running queries and enable dfs.permissions. This should
prevent anyone from writing to the files. Although jobs running and
failing will be less then optimal. After a few failures hopefully
users will get the point.


figuring out the right setting for dfs.datanode.max.xcievers

2011-03-30 Thread Igor Tatarinov
I haven't found a good description on this setting and the costs in setting
it too high. Hope somebody can explain.

I have about a year's worth of data partitioned by date. Using 10 nodes and
setting xcievers to 5000, I can only save into 100 or so partitions. As a
result, I have to do 4 rounds of saving data into the underlying partitioned
table (in s3). That's pretty slow.

Should I just set xcievers to 1M or will hadoop crash a result? Is each
xciever really a separate thread?

When will the spelling be corrected? :)

Thanks a bunch!


how to convert single line into multiple lines in a serde (txt in txt out)?

2011-03-30 Thread Michael Jiang
Want to extend RegexSerDe to parse apache web log: for each log entry, need
to convert it into multiple entries. This is easy in streaming. But new to
serde, wondering if it is doable and how? Thanks!


Re: how to convert single line into multiple lines in a serde (txt in txt out)?

2011-03-30 Thread Edward Capriolo
On Wed, Mar 30, 2011 at 2:55 PM, Michael Jiang it.mjji...@gmail.com wrote:
 Want to extend RegexSerDe to parse apache web log: for each log entry, need
 to convert it into multiple entries. This is easy in streaming. But new to
 serde, wondering if it is doable and how? Thanks!


You can have your serde produce liststruct and then explode() them.


Re: insert - Hadoop vs. Hive

2011-03-30 Thread Ashish Thusoo
If the data is already in the right format you should use LOAD syntax in Hive. 
This basically moves files into hdfs (so it should be not less performant than 
hdfs). If the data is not in the correct format or it needs to be transformed 
then the insert statement needs to be used.

Ashish
On Mar 30, 2011, at 6:30 AM, David Zonsheine wrote:

 
 Hi,
 
 I'm trying to compare adding files to hdfs for Hive usage using Hive inserts 
 vs. adding to the hdfs directly then using Hive.
 
 Any comments, blogging about this?
 
 Thanks a lot,
 
 David Zonsheine
 



Re: how to convert single line into multiple lines in a serde (txt in txt out)?

2011-03-30 Thread Michael Jiang
Thanks Edward. You mean implement deserialize to return a liststruct?
What is explode()? Sorry for basic questions. Could you please elaborate
this a bit more or give me a link to some reference? Thanks!

On Wed, Mar 30, 2011 at 12:03 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 On Wed, Mar 30, 2011 at 2:55 PM, Michael Jiang it.mjji...@gmail.com
 wrote:
  Want to extend RegexSerDe to parse apache web log: for each log entry,
 need
  to convert it into multiple entries. This is easy in streaming. But new
 to
  serde, wondering if it is doable and how? Thanks!
 

 You can have your serde produce liststruct and then explode() them.



Re: INSERT OVERWRITE LOCAL DIRECTORY -- Why it creates multiple files

2011-03-30 Thread V.Senthil Kumar
Thanks for the suggestion. The query created just one result file.  

Also, before trying this query, I have found out another way of making this 
work. I have added the following properties in hive-site.xml and it worked as 
well. It created just one result file. 


property
  namehive.merge.mapredfiles/name
  valuetrue/value
  descriptionMerge small files at the end of a map-reduce job/description
/property

property
  namehive.input.format/name
  valueorg.apache.hadoop.hive.ql.io.CombineHiveInputFormat/value
  descriptionThe default input format, if it is not specified, the system 
assigns it. It is set to HiveInputFormat for hadoop versions 17, 18 and 19, 
whereas it is set to CombineHiveInputFormat for hadoop 20. The user can always 
overwrite it - if there is a bug in CombineHiveInputFormat, it can always be 
manually set to HiveInputFormat. /description
/property



- Original Message 
From: Jov zhao6...@gmail.com
To: user@hive.apache.org
Sent: Tue, March 29, 2011 10:22:32 PM
Subject: Re: INSERT OVERWRITE LOCAL DIRECTORY -- Why it creates multiple files

try add limit:

INSERT OVERWRITE LOCAL DIRECTORY
'/home/hdp-user/hiveadmin_dirs/outbox/apachetest'
Select host, identity, user, time, request
from raw_apachelog
where ds = '2011-03-22-001500' limit 32;


2011/3/30 V.Senthil Kumar vaisen2...@yahoo.com:
 Hello,

 I have a hive query which does a simple select and writes the results to a 
local

 file system.


 For example, a query like this,

 INSERT OVERWRITE LOCAL DIRECTORY
 '/home/hdp-user/hiveadmin_dirs/outbox/apachetest'
 Select host, identity, user, time, request
 from raw_apachelog
 where ds = '2011-03-22-001500';

 Now this creates a two files under apachetest folder. This table has only 32
 rows. Is there any way I can make Hive to create only single file ?


 Appreciate your help :)

 Thanks,
 Senthil




Re: how to convert single line into multiple lines in a serde (txt in txt out)?

2011-03-30 Thread Michael Jiang
Thanks Edward. That'll work.

But that also means 2 tables will be created. How about we only want one
table by using some serde s.t. it reads apache web log, generates multiple
rows for each line of entry in the log that get loaded into the target table
that I want? Is it doable by customizing RegexSerde? i.e. create external
table A (...) row format serde 'serdeclass' with serdeproperties (...)
stored as textfile location 'pathtoapachelog'; will give you the table that
has right fields extracted and multiple rows generated for query later.

If I cannot create such a serde without creating a 2nd table for the task, I
think streaming is a better choice from source code management aspect: using
serde requires you to manage more libraries (hadoop, hive ...) for build.

Thanks!

On Wed, Mar 30, 2011 at 1:16 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 On Wed, Mar 30, 2011 at 3:46 PM, Michael Jiang it.mjji...@gmail.com
 wrote:
  Also what if I want just one step to load each log entry line from log
 file
  and for each generate multiple lines? That is, just one table created. I
  don't want to have one table and then call explode() to get multiple
 lines.
  Otherwise, alternative way is to use streaming on loaded table to turn it
  into another one with no need to customize a serde. So, yeah, the goal
 here
  is to see how a serde can do this stuff.
 
  Thanks!
 
  On Wed, Mar 30, 2011 at 12:03 PM, Edward Capriolo edlinuxg...@gmail.com
 
  wrote:
 
  On Wed, Mar 30, 2011 at 2:55 PM, Michael Jiang it.mjji...@gmail.com
  wrote:
   Want to extend RegexSerDe to parse apache web log: for each log entry,
   need
   to convert it into multiple entries. This is easy in streaming. But
 new
   to
   serde, wondering if it is doable and how? Thanks!
  
 
  You can have your serde produce liststruct and then explode() them.
 
 

 The role of SerDe is to take the output from the InputFormat and use
 the information inside the metastore to decode it. As a result this is
 not a good fit for a spot to turn a single row into multiple rows.

 What I am suggesting is define a column like this

 create table ...( id int, listString log_entries) RowFormat serde

 Make sure your serde decodes and populates log_entires.

 From there you can use lateral view and explode
 http://wiki.apache.org/hadoop/Hive/LanguageManual/LateralView to turn
 the listString into rows.


 Edward